Mastodon hachyterm.io

I have been using a self-hosted Redlib as an alternative front-end for Reddit. I like Redlib better than the original UI, as it is quite simple and doesn’t bother me with any popups.

My instance is not private, so in theory, everyone can use it. As I do not advertise my instance, the traffic on my server was negligible.

However, lately AI scrapers seem to have discovered my instance. That has led to my instance being rate-limited.

One of the suggested solutions to this problem is using Anubis as a firewall to discourage scrapers from abusing your website.

Anubis is a Web AI Firewall Utility that weighs the soul of your connection using one or more challenges in order to protect upstream resources from scraper bots. This program is designed to help protect the small internet from the endless storm of requests that flood in from AI companies. Anubis is as lightweight as possible to ensure that everyone can afford to protect the communities closest to them.

Setup Anubis with Docker Compose

I use Traefik with Docker Swarm to deploy to a VPS.

Adding Anubis was straightforward and very similar to the docs:

version: '3.8'
services:
  anubis:
    image: ghcr.io/techarohq/anubis:latest
    environment:
      BIND: ":8080"
      DIFFICULTY: "5"
      METRICS_BIND: ":9090"
      SERVE_ROBOTS_TXT: "true"
      TARGET: "http://redlib:8080"
      POLICY_FNAME: "/data/cfg/botPolicy.yaml"
      OG_PASSTHROUGH: "true"
      OG_EXPIRY_TIME: "24h"
    volumes:
      - "/home/$SERVER_USER/anubis/botPolicy.yaml:/data/cfg/botPolicy.yaml:ro"
    deploy:
      labels:
        - traefik.enable=true
        - traefik.constraint-label=public
        - traefik.http.routers.anubis.entrypoints=websecure
        - traefik.http.routers.anubis.rule=<your URL>
        - traefik.http.services.anubis.loadbalancer.server.port=8080
        # other labels, e.g. tls options, rate-limiting, etc.
    networks:
      - public
    depends_on:
      - redlib

  redlib:
    container_name: redlib
    image: quay.io/redlib/redlib
    environment:
      REDLIB_ROBOTS_DISABLE_INDEXING: on
    # No Traefik labels - accessed internally through Anubis only
    networks:
      - public

networks:
  public:
    external: true

Now Redlib is not accessible publicly, because Traefik will route the URL to Anubis instead of Redlib.

Configuration via botPolicy.yaml

The documentation has a minimal configuration example that you can use:

bots:
  - name: cloudflare-workers
    headers_regex:
      CF-Worker: .*
    action: DENY
  - name: well-known
    path_regex: ^/.well-known/.*$
    action: ALLOW
  - name: favicon
    path_regex: ^/favicon.ico$
    action: ALLOW
  - name: robots-txt
    path_regex: ^/robots.txt$
    action: ALLOW
  - name: generic-browser
    user_agent_regex: Mozilla
    action: CHALLENGE

Alternatively, you can dowload the default policy and comment/uncomment what’s needed.
I found the configuration a bit confusing, to be honest.

If you use docker compose/Docker Swarm, make sure to copy the file to your VPS and use volume binding to make the yaml file accessible to your container.