Are there some recommended lists for robots.txt? I only want to allow robots that benefit me or the public (eg search engines, universities etc) but block everything that only crawls for their own benefit (eg chatgpt, archive websites).
Thalidomide Vintage Ad Shirt $22.14 |
Thalidomide Vintage Ad Shirt $22.14 |
we're archivists b***h we don't give a FRICK about robots.txt.
Well obviously I can't enforce it but most corporate bots follow the instructions. I don't want companies like ChatGPT to crawl my site they can go frick themselves.
>most corporate bots follow the instructions
>most corporate bots follow the instructions
>most corporate bots follow the instructions
you can't seriously believe this anon?
the only bots following robots.txt are the ones that you actually want to index your site
the others don't give a shit and spoof their user-agent in the first place
>most corporate bots follow the instructions
>robots.txt
heh
Disallow: *
>YEA.. UHM AH AHELLO DEAREST INDEXER BOTTERINOS..!!
>PLS DO !!NOT!! INDEX THESES SPECIFIC RESOURCES FROM MY WEBPAGE!!
>ITS HECKIN PRIVATERINO!! ... SO JUST IGNORE NOTHINNG TO SEE HERE!!
This is what IQfyaca's actually believe.
You fricking Black person learn how to read. I don't want to suck corporate wiener and give them free money if it doesn't benefit me. I don't care about you cringe script kiddies, cloudflare will take care of them.
Holy frick you are moronic.
you should have edited it to wink at the end
i hear chinese crawlers in particular will literally rape your site. just a heads up.
Yeah. Bytedance don't give a frick about your robots.txt. and they damn near ddos your site
>bytedance
What a shitty name. It sounds like the startup names me and my friend would come up with when we were 16 and thought we had genius.
i don't remember if it was bytedance but i remember it was some chinese shit. i've been told it was partially my fault for not protecting my site enough / setting it up correctly and that MAYBE played a role. but holy fricking shit. i had to pull the plug for a bit.
did you really have to pull the plug?
How do you know it's bytedance?
Yes they do. My site is literally 99% traffic from chinese bots.
Just block SYN packets with a TTL higher than 128. That takes out most phones which also takes out most bot farms. Also disable IPV6 then block any SYN packets with an MSS other than 1460. That also eliminates some VPN users.
It's just some small site on a cheap webhoster I can't change anything. I'm not complaining as long as the site is still working it's just sad seeing that 99% of the traffic comes from bots. No wonder big sites are all using cloudflare these days.
Cloudflare AFAIK does not have controls to do what I suggest whereas this can be done on any little cheap VM. They take the more expensive approach by trying to really know who is a bot and who is not. They get it wrong a lot. I take the fascist approach of just blocking phones and have no regrets.
No but lots of crawlers kindly and needfully put their names in the user agent. Maybe you can dynamically load a robots.txt depending on that
Didn't someone already do that and make a list that I can use?
Ah, so you're looking to host your site on the sickdarknet.
it's the first thing i look at for when i want to download super secret hacker stuff
Any equivalent .txt file I can add to stop black people using my site?
Blackbots.txt
IP block Africa and the United States
I completely block morons without H2. Don't care, not my problem
the fact you are advanced enough in your quest to host a publicly available web service to worry about it being crawled yet don't seem to understand that the only way to prevent it is to not have your website be publicly available is disheartening and frankly saddening.