CodeNewbie Community 🌱

Sharon428931
Sharon428931

Posted on

Stop AI Scrapers Cold: SafeLine WAF Rules for Self-Hosted Git Repos

AI-powered crawlers are getting smarter—and they’re not shy about grabbing open code from public and self-hosted Git repositories. If you run a private Git server (Gitea, GitLab CE, etc.) in your homelab or company network, you’ve probably seen unusual requests in your logs. To keep your intellectual property safe, SafeLine WAF provides a Git-focused ruleset to block suspicious AI crawlers and automated tools.


Why Do You Need This?

  • Code theft and scraping: AI models train on anything they can find. Your Git repo is no exception.
  • Increased traffic noise: Automated tools waste resources and can lead to performance issues.
  • Security risks: Bots may attempt brute-force access or map your repo structure.

This SafeLine ruleset specifically targets AI bots and generic scrapers discovered in real-world traffic. It’s tuned for SafeLine 7.3.0 and above.


What’s in the Ruleset?

Whitelist: None (you can add your own trusted clients).

Blacklist (examples):

  1. Block automation tools – catches common HTTP clients used for automated cloning and requests.

  1. Block AI crawlers – stops User-Agents identified as AI-related scrapers.

  1. Block unknown scrapers – covers suspicious sources with no clear identification.
  2. Block missing User-Agent – many bots don’t send headers.
  3. Block any User-Agent containing “Bot” – regex [Bb]ot picks up common patterns.

These patterns ensure any suspicious or automated request gets denied before touching your Git repos.


How to Configure It in SafeLine

Here’s a simplified YAML-style example to show how you could set up these rules:

rules:
  - name: Block missing UA
    match: Header.User-Agent == null
    action: deny

  - name: Block AI crawlers
    match: Header.User-Agent matches "AI|GPT|LLM"
    action: deny

  - name: Block known automation tools
    match: Header.User-Agent matches "curl|wget|python|go-http-client"
    action: deny

  - name: Block unknown sources
    match: Composite conditions for unidentified clients
    action: deny

  - name: Block UA containing Bot
    match: Header.User-Agent matches "[Bb]ot"
    action: deny
Enter fullscreen mode Exit fullscreen mode

Tip: Always test in a staging environment before applying to production. Overly aggressive filters may block legitimate developers or CI/CD pipelines.


When to Use It?

  • Self-hosted Git (homelab or SMB) – Protect your personal or internal code.
  • Public mirrors with private components – Stop indexers and scrapers.
  • IP-restricted repos – Adds an extra layer of header-based filtering.

Join the SafeLine Community

If you continue to experience issues, feel free to contact SafeLine support for further assistance.

Top comments (0)