I found interesting that they maintain a list of domains owned by big media corporations[1] (and at the moment it seems to be mainly Disney, which goes to show just how big the thing is).
They also rank sites by how sustainable they are[2] (Lobste.rs gets an A+!).
We cache your old /robots.txt for up to 24 hours, to avoid spamming your server with new requests for it every time we want to check a page
This might make sense as a default but it may make sense to consider the Cache-Control configuration of the robots.txt as some people want to react to new crawlers faster than 24h. I would probably still clamp the refresh times (maybe between 5min and 7d) and the 24h default sounds reasonable (if the site doesn't indicate a good value) but it would be nice to follow the site's expressed wishes.
I found interesting that they maintain a list of domains owned by big media corporations[1] (and at the moment it seems to be mainly Disney, which goes to show just how big the thing is).
They also rank sites by how sustainable they are[2] (Lobste.rs gets an A+!).
It also supports DuckDuckGo-like bangs[3].
[1] https://codeberg.org/Clew/big-media-domains/src/branch/main/listings.toml
[2] https://clew.se/green/
[3] https://clew.se/bangs/
I really like indieweb projects
https://marginalia-search.com/ is also a good search engine
One more to the bucket: https://mwmbl.org
https://wiby.me
This might make sense as a default but it may make sense to consider the
Cache-Controlconfiguration of the robots.txt as some people want to react to new crawlers faster than 24h. I would probably still clamp the refresh times (maybe between 5min and 7d) and the 24h default sounds reasonable (if the site doesn't indicate a good value) but it would be nice to follow the site's expressed wishes.