A pure-Go rewrite of btdig/dhtcrawler2
(originally written in Erlang + MongoDB). Joins the BitTorrent Mainline
DHT, harvests infohashes from passing queries, downloads .torrent
metadata directly from peers via BEP-9, and exposes a search interface
over the result.
Zero CGO, single static binary, SQLite + FTS5 for storage and full-text search.
The original is great but has rough edges today:
| Original | This rewrite |
|---|---|
| Erlang/OTP runtime to install | Single static Go binary |
| MongoDB server to run + index | Embedded SQLite file |
| Sphinx (optional) for full-text | SQLite FTS5 (built in) |
Fetched .torrent from torcache.net etc. |
Fetches directly from peers via BEP-9 |
Custom kdht DHT engine |
Hand-rolled passive crawler |
The protocol behaviour is the same: be a passive participant in the DHT,
respond to incoming get_peers and announce_peer queries with
"neighbor" IDs that make us appear close to many infohashes, and harvest
the hashes that pass through.
┌───────────────┐
incoming UDP ──▶ DHT Crawler │
│ (passive, │
│ Sybil-style) │
└───────┬───────┘
│ samples (infohash + peer addr)
▼
┌───────────────┐ ┌──────────────┐
│ Sample │────────▶│ peerHints │
│ Consumer │ │ (LRU cache) │
│ (batched DB │ └──────┬───────┘
│ writes) │ │
└───────┬───────┘ │
│ │
▼ │
┌───────────────┐ │
│ SQLite + │◀──────────────┐│
│ FTS5 │ ││
└───────┬───────┘ ││
▲ ││
│ ││
│ store metadata ││
│ ││
┌───────────────┐ ││
│ Metadata │◀──────────────┘│
│ Workers │ pending hashes │
│ (BEP-9 fetch) │◀────────────────┘
└───────┬───────┘ peer hints
│
▼ TCP to peers
┌───────────────┐
│ HTTP Server │ search / browse
└───────────────┘
Three subsystems run concurrently inside one process:
- The DHT crawler (
internal/dht) listens on UDP, responds to queries, and sendsfind_nodeto keep itself visible. It pushes observed infohashes onto a sample channel. - The metadata workers (
internal/metadata) pull pending hashes from the DB, look up cached peer addresses, dial via TCP, complete the BitTorrent handshake andut_metadataextension exchange, verify the SHA-1 of the assembled metadata against the infohash, and store the result. - The HTTP server (
internal/web) serves search, recent, top, and per-torrent detail pages.
If your crawler is collecting torrents but search returns nothing, run the diagnostic tool against your live database:
go run ./cmd/dbtool -db dhtcrawler.db
go run ./cmd/dbtool -db dhtcrawler.db -q "your search term"It reports:
- Row counts for
torrents,torrents_fts,files,hashes - A sample of torrent names with their character class breakdown (ASCII vs CJK vs Cyrillic vs ...)
- Whether
torrents_ftsis aligned withtorrents - Hit counts for common search terms
.
├── cmd/dhtcrawler/ # main binary
│ ├── main.go # orchestration
│ └── peerhints.go # bounded LRU of infohash -> peer addrs
├── internal/
│ ├── bencode/ # bencode codec (with tests)
│ ├── dht/ # passive DHT crawler + KRPC
│ ├── metadata/ # BEP-9 metadata fetcher
│ ├── store/ # SQLite + FTS5 storage layer
│ ├── web/ # HTTP frontend (templates embedded)
│ │ └── templates/
│ └── config/ # flags + JSON config
├── go.mod
└── README.md
- Go 1.22+ to build. That's the only build-time dependency.
- An open UDP port (default 6881) reachable from the public internet. NAT'd setups still work, they just observe fewer hashes.
- A few hundred MB of disk for the SQLite file once you've collected meaningful data.
No SQLite server, no MongoDB, no C compiler, no other runtime.
git clone https://github.com/botsgalaxy/go-dhtcrawler.git
cd go-dhtcrawler
go build -o dhtcrawler ./cmd/dhtcrawlerThat produces a single ~16 MB static binary called dhtcrawler.
If you want a smaller, fully-stripped release build:
CGO_ENABLED=0 go build -trimpath -ldflags="-s -w" -o dhtcrawler ./cmd/dhtcrawlerQuick start (defaults are fine for a first run):
./dhtcrawlerThen open http://127.0.0.1:8000/ in your browser.
The first few minutes will look quiet. The crawler needs to (a) bootstrap into the DHT, (b) be queried by other peers, (c) attempt metadata fetches. Realistically, give it 5–30 minutes before judging whether things are working. Watch the periodic stats line in stderr:
dht: rx_q=12450 rx_r=2031 tx_fn=8200 samples=4112 (+842 in 30s) qlen=10000
samples= is the count of infohashes observed; once that's climbing
steadily, you're in. The meta: log lines record successful metadata
downloads.
-db string SQLite path (default "dhtcrawler.db")
-dht-addr string UDP listen address (default "0.0.0.0:6881")
-dht-rate int find_node packets per second (default 200)
-dht-queue int max queued nodes for find_node (default 10000)
-meta-workers int concurrent metadata fetchers (default 20)
-meta-tries int max attempts before marking a hash failed (5)
-meta-retry-hours int retry pending hashes after N hours (default 6)
-meta-timeout int per-fetch timeout in seconds (default 15)
-http-addr string HTTP UI bind (default "127.0.0.1:8000")
-torrent-dir string optional dir to save .torrent files (default off)
-config string load JSON config from file
Equivalent to flags. Useful for systemd / docker:
{
"dht_listen_addr": "0.0.0.0:6881",
"dht_find_node_rate": 300,
"meta_workers": 30,
"http_listen_addr": "0.0.0.0:8000",
"db_path": "/var/lib/dhtcrawler/data.db",
"torrent_dir": "/var/lib/dhtcrawler/torrents"
}./dhtcrawler -config /etc/dhtcrawler.jsonFor a non-trivial collection rate, you really do want a public IP and the UDP port forwarded / unblocked. On Linux you may need to bump UDP buffer sizes:
sudo sysctl -w net.core.rmem_max=4194304
sudo sysctl -w net.core.wmem_max=4194304If you bind to a port < 1024 you'll need CAP_NET_BIND_SERVICE or root.
The default 6881 doesn't have this issue.
# /etc/systemd/system/dhtcrawler.service
[Unit]
Description=go-dhtcrawler
After=network-online.target
Wants=network-online.target
[Service]
ExecStart=/usr/local/bin/dhtcrawler -config /etc/dhtcrawler.json
Restart=on-failure
User=dhtcrawler
StateDirectory=dhtcrawler
WorkingDirectory=/var/lib/dhtcrawler
[Install]
WantedBy=multi-user.target| Path | What it shows |
|---|---|
/ |
Recent torrents |
/search?q=... |
Full-text search across names and file paths |
/recent |
100 most recently added |
/top |
Most-frequently-announced |
/stats |
Counter dashboard |
/view/<infohash> |
Detail page with full file list and magnet link |
Search uses SQLite FTS5 with the trigram tokenizer. Every 3-character
substring is indexed, which is the only built-in tokenizer that handles
non-Latin scripts (Chinese, Japanese, Cyrillic, Arabic, and so on). Each
whitespace-separated term of 3+ characters is wrapped in quotes and
ANDed, so pink floyd finds rows that match both words anywhere in the
name or file paths. Shorter terms fall back to a LIKE scan over names.
The SQLite file is fully usable from outside:
sqlite3 dhtcrawler.db
sqlite> SELECT name, length, announce FROM torrents ORDER BY announce DESC LIMIT 10;
sqlite> SELECT name FROM torrents_fts WHERE torrents_fts MATCH 'ubuntu';Schema:
| table | purpose |
|---|---|
hashes |
every observed infohash; state, req_cnt, retry bookkeeping |
torrents |
resolved torrents (one row per infohash with metadata) |
files |
file list for multi-file torrents |
torrents_fts |
FTS5 virtual table over name + file paths |
go test ./...The internal/bencode package has the most thorough tests; the network
subsystems are exercised by the smoke run rather than unit tests.
The original Erlang code uses the same trick and it's worth explaining:
When another DHT node sends you get_peers <infohash>, you reply as if
your node ID is one bit-byte of distance away from that infohash. Their
routing table now has you in the bucket for hashes near that infohash.
Next time any node asks them for peers near that hash, you're in the
list of "closest" nodes they return.
Multiplied across thousands of peers, a passive crawler ends up
receiving an outsized share of get_peers and announce_peer traffic,
which is exactly the traffic that reveals active infohashes.
We do this in dht.NeighborID() (see internal/dht/node.go) and apply
it in Crawler.handleQuery().
The original repo carried a few side projects I didn't port:
- MongoDB replica set tooling. SQLite is single-node, so it doesn't apply.
- Sphinx integration. FTS5 covers full-text search on its own.
tor_builder/loc_torrent_cache. These were caching layers for torcache-fetched files. We fetch from peers directly, so they're moot.- The torcache.net / torrange.com / btbox.n0808.com fetch fallbacks. Those services are dead.
If you want any of this back, the code is small and modular enough that adding it is a hundred-line job rather than a refactor.
A DHT crawler observes traffic that anyone running a DHT client can observe. Storing what you observe is generally fine. Distributing copyright-infringing content is not. This tool indexes magnet links. Anyone using those links to download content is responsible for their own choices, same as any other search engine. Run it where it's legal to run it.
BSD-2-Clause. See LICENSE. The original dhtcrawler2 is
also BSD-licensed. Credit to Kevin Lynx for the original design.
- GitHub: botsgalaxy
- Telegram: @primeakash