Just now, I took a look at the HTTP logs on git.sr.ht. Of the past 100,000 HTTP requests received by git.sr.ht (representing about 2½ hours of logs), 4,774 have been requested by GoModuleProxy — 5% of all traffic. And their requests are not cheap: every one is a complete git clone. They come in bursts, so every few minutes we get a big spike from Go, along with a constant murmur of Go traffic.
This has been ongoing since around the release of Go 1.16, which came with some changes to how Go uses modules. Since this release, following a gradual ramp-up in traffic as the release was rolled out to users, git.sr.ht has had a constant floor of I/O and network load for which the majority can be attributed to Go.
I started to suspect that something strange was going on when our I/O alarms started going off in February 2021 (we eventually had to tune these alarms up above the floor of I/O noise generated by Go), correlated with lots of activity from a Go user agent. I was able to narrow it down with some effort, but to the credit of the Go team they did change their User-Agent to make more apparent what was going on. Ultimately, this proved to be the end of the Go team’s helpfulness in this matter.
I did narrow it down: it turns out that the Go Module Mirror runs some crawlers that periodically clone Git repositories with Go modules in them to check for updates. Once we had narrowed this down, I filed a second ticket to address the problem.
I came to understand that the design of this feature is questionable. For a start, I never really appreciated the fact that Go secretly calls home to Google to fetch modules through a proxy (you can set GOPROXY=direct to fix this). Even taking the utility at face value, however, the implementation leaves much to be desired. The service is distributed across many nodes which all crawl modules independently of one another, resulting in very redundant git traffic.
This is a sample from a larger set which shows the hashes of git repositories on the right (names were hashed for privacy reasons), and the number of times they were cloned over the course of an hour. The main culprit is the fact that the nodes all crawl independently and don’t communicate with each other, but the per-node stats are not great either: each IP address still clones the same repositories 8-10 times per hour. Another user hosting their own git repos noted a single module being downloaded over 500 times in a single day, generating 4 GiB of traffic.
The Go team holds that this service is not a crawler, and thus they do not obey robots.txt — if they did, I could use it to configure a more reasonable “Crawl-Delay” to control the pace of their crawling efforts. I also suggested keeping the repositories stored on-site and only doing a git fetch, rather than a fresh git clone every time, or using shallow clones. They could also just fetch fresh data when users request it, instead of pro-actively crawling the cache all of the time. All of these suggestions fell on deaf ears, the Go team has not prioritized it, and a year later I am still being DDoSed by Google as a matter of course.
I was banned from the Go issue tracker for mysterious reasons,1 so I cannot continue to nag them for a fix. I can’t blackhole their IP addresses, because that would make all Go modules hosted on git.sr.ht stop working for default Go configurations (i.e. without GOPROXY=direct). I tried to advocate for Linux distros to patch out GOPROXY by default, citing privacy reasons, but I was unsuccessful. I have no further recourse but to tolerate having our little-fish service DoS’d by a 1.38 trillion dollar company. But I will say that if I was in their position, and my service was mistakenly sending an excessive amount of traffic to someone else, I would make it my first priority to fix it. But I suppose no one will get promoted for prioritizing that at Google.
In violation of Go’s own Code of Conduct, by the way, which requires that participants are notified moderator actions against them and given the opportunity to appeal. I happen to be well versed in Go’s CoC given that I was banned once before without notice — a ban which was later overturned on the grounds that the moderator was wrong in the first place. Great community, guys. ↩︎