'AI' Crawlers Hammering Git Repos Across the Web – A Rate Limiting Approach

TLDR

AI-powered crawlers were aggressively scraping our Git repositories, causing server overload. We initially scaled up but later implemented fine-tuned Nginx rate limiting using Lua. The solution drastically reduced load while keeping the web frontend fully accessible.

Introduction

If you host a public code repository on the web, you’ve likely faced issues in recent months. AI-powered crawlers are aggressively scraping public repositories, collecting any code they can access. This growing concern has sparked discussions across the tech community. For reference, see this Hacker News thread and this blog post.

I help maintain some large open-source Git repositories and encountered this issue as well. After spending a lot of time tweaking robots.txt, applying rate limits per IP, and even blocking entire IP ranges from certain ASNs, the crawlers kept hammering the servers, causing excessive load. We use cgit to host large code repositories, and despite caching layers like Varnish and cgit’s internal cache, it wasn’t enough. Some operations are inherently resource-intensive, and the relentless requests from these crawlers were consuming all available system resources.

Note: We don’t use a WAF system or any CAPTCHA for our services. We prefer to keep things simple, without external systems hooked into our users’ browsers, relying only on open-source tools.

Identifying the Threat

After reviewing 24 hours of logs, here’s what we found:

  • ~145k requests
  • ~140k unique IP addresses (IPv4 and IPv6)

A simple rate limit per IP address is ineffective in this case, as they rotate through a vast pool of IPs to evade traditional blocks and limits.

From the sampled data, the IP addresses belonged to telecom providers worldwide. This suggests that these crawlers are using residential proxies for their scraping. Unfortunately, many companies offer this service, making it even harder to block them effectively.

First Big Action

Deploy More Servers! 🚀

Since we use gdnsd in our infrastructure, launching additional servers to distribute the load globally was straightforward.

We deployed five additional high-performance servers, ranging from 96 to 48 threads, ensuring cgit had more CPU resources and processes to handle the load efficiently.

It worked for a few weeks. The servers handled a high load, but we were able to serve users without issues—until the crawlers ramped up their request rates again.

cgit-last-graph-before

Note: At this point, we gained another insight—most of the crawlers were coming from South America, Europe, and Asia, while servers in the United States were running smoothly.

Fine-Tuned Nginx Rate Limiting

After analyzing the logs again, we identified that the most frequently requested URIs by crawlers were:

/$repo/(log|plain)/XXXYYY?id=<commit-hash>

Regular users frequently access /log, and sometimes /plain, but requests to /log or /plain with a specific commit hash are far less common. This led us to the idea of applying rate limiting specifically when a commit hash is included in the request.

To implement a more complex rate limit that applies only to specific requests, we needed to use access_by_lua_block. This allowed us to apply custom logic based on Nginx location directives with specific parameters, ensuring that only targeted requests were affected.

However, to use this feature, Lua support must be enabled in Nginx, either by compiling it with the lua-nginx-module or using a pre-built package that includes Lua support.

The Rate Limit

limit_conn_status 429;
limit_req_status 429;
limit_req_zone $host zone=source_perserver:10m rate=20r/m;

server {
  [...]

It’s not a rate limit of twenty requests per minute per IP address, but rather per virtual host or hostname that users are accessing.

location ~ ^/repo-name/(log|plain)/ {

  access_by_lua_block {
    local args = ngx.req.get_uri_args()
    if args["id"] then
      --ngx.log(ngx.ERR, "Debug - Redirecting to /limit-aggressive for rate limiting: " .. ngx.var.request_uri)
      return ngx.exec("/limit-aggressive")
    end
  }

  # Continue routing regular users through Varnish
  proxy_pass  http://varnish;
  proxy_redirect default;
}

location /limit-aggressive {
    limit_req zone=source_perserver burst=20;
    proxy_pass  http://varnish;
    proxy_redirect default;
}

And then, the load instantly decreased, as shown in the following graph.

cgit-last-graph-after

Most importantly, users can still access the web frontend for the Git repositories without any issues.

To see the rate limiting in action, use tail on the logs and look for 429 responses.

Final Thoughts

To the ‘AI’ crawlers out there—whatever you may be—please teach your systems to use git clone when discovering a repository. It’s a simpler and more efficient approach for everyone. [insert khabi meme here]

– dbaio

References & Further Reading