Reverse Proxy to ENFORCE the robots.txt against malicious crawlers that don't respect it
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
forest 9623b29ea6 Update 'ReadMe.md' 4 months ago
.gitignore refactor to build inside docker container. 4 months ago
Dockerfile asd 4 months ago
ReadMe.md Update 'ReadMe.md' 4 months ago
bible.txt add unblock instructions to readme 4 months ago
config.json fix the config :X 4 months ago
go.mod move to forest git server real quick 4 months ago
go.sum move to forest git server real quick 4 months ago
main.go re-enable tarpit for testing. 4 months ago

ReadMe.md

forgejo-crawler-blocker

What does a GPT training web-crawler see when it tries to access our forgejo instance and look at every single file at every single commit, ignoring robots.txt and sending a generic user-agent header? Here is the preview:

https://git.cyberia.club/bible.txt

Yep thats right. The entire christian bible (4MB), at about 100 bytes per second.

maintenance

if anyone needs to clear the data to unblock someone, these are the commands to run on paimon:

sudo -i

docker stop gitea_forgejo-crawler-blocker
rm /etc/docker-compose/gitea/forgejo-crawler-blocker/traffic.db
docker start gitea_forgejo-crawler-blocker

persistent data storage

/forgejo-crawler-blocker/data inside the docker container.

forests manaul build process

Run on server: (paimon)

cd /home/forest/forgejo-crawler-blocker && git pull sequentialread main  && cd /etc/docker-compose/gitea && docker stop gitea_forgejo-crawler-blocker_1 || true && docker rm gitea_forgejo-crawler-blocker_1 || true && docker image rm gitea_forgejo-crawler-blocker || true && rm -f forgejo-crawler-blocker/traffic.db && docker-compose up -d && sleep 1 && docker logs -n 1000 -f gitea_forgejo-crawler-blocker_1