Skip to content

botsgalaxy/go-dhtcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

go-dhtcrawler

A pure-Go rewrite of btdig/dhtcrawler2 (originally written in Erlang + MongoDB). Joins the BitTorrent Mainline DHT, harvests infohashes from passing queries, downloads .torrent metadata directly from peers via BEP-9, and exposes a search interface over the result.

Zero CGO, single static binary, SQLite + FTS5 for storage and full-text search.

Why a rewrite?

The original is great but has rough edges today:

Original This rewrite
Erlang/OTP runtime to install Single static Go binary
MongoDB server to run + index Embedded SQLite file
Sphinx (optional) for full-text SQLite FTS5 (built in)
Fetched .torrent from torcache.net etc. Fetches directly from peers via BEP-9
Custom kdht DHT engine Hand-rolled passive crawler

The protocol behaviour is the same: be a passive participant in the DHT, respond to incoming get_peers and announce_peer queries with "neighbor" IDs that make us appear close to many infohashes, and harvest the hashes that pass through.

Architecture

                  ┌───────────────┐
   incoming UDP ──▶  DHT Crawler  │
                  │ (passive,     │
                  │  Sybil-style) │
                  └───────┬───────┘
                          │ samples (infohash + peer addr)
                          ▼
                  ┌───────────────┐         ┌──────────────┐
                  │ Sample        │────────▶│   peerHints  │
                  │ Consumer      │         │ (LRU cache)  │
                  │ (batched DB   │         └──────┬───────┘
                  │  writes)      │                │
                  └───────┬───────┘                │
                          │                        │
                          ▼                        │
                  ┌───────────────┐                │
                  │   SQLite +    │◀──────────────┐│
                  │   FTS5        │               ││
                  └───────┬───────┘               ││
                          ▲                       ││
                          │                       ││
                          │  store metadata       ││
                          │                       ││
                  ┌───────────────┐               ││
                  │ Metadata      │◀──────────────┘│
                  │ Workers       │  pending hashes │
                  │ (BEP-9 fetch) │◀────────────────┘
                  └───────┬───────┘  peer hints
                          │
                          ▼ TCP to peers
                  ┌───────────────┐
                  │  HTTP Server  │  search / browse
                  └───────────────┘

Three subsystems run concurrently inside one process:

  1. The DHT crawler (internal/dht) listens on UDP, responds to queries, and sends find_node to keep itself visible. It pushes observed infohashes onto a sample channel.
  2. The metadata workers (internal/metadata) pull pending hashes from the DB, look up cached peer addresses, dial via TCP, complete the BitTorrent handshake and ut_metadata extension exchange, verify the SHA-1 of the assembled metadata against the infohash, and store the result.
  3. The HTTP server (internal/web) serves search, recent, top, and per-torrent detail pages.

Troubleshooting search

If your crawler is collecting torrents but search returns nothing, run the diagnostic tool against your live database:

go run ./cmd/dbtool -db dhtcrawler.db
go run ./cmd/dbtool -db dhtcrawler.db -q "your search term"

It reports:

  • Row counts for torrents, torrents_fts, files, hashes
  • A sample of torrent names with their character class breakdown (ASCII vs CJK vs Cyrillic vs ...)
  • Whether torrents_fts is aligned with torrents
  • Hit counts for common search terms

Project layout

.
├── cmd/dhtcrawler/        # main binary
│   ├── main.go            # orchestration
│   └── peerhints.go       # bounded LRU of infohash -> peer addrs
├── internal/
│   ├── bencode/           # bencode codec (with tests)
│   ├── dht/               # passive DHT crawler + KRPC
│   ├── metadata/          # BEP-9 metadata fetcher
│   ├── store/             # SQLite + FTS5 storage layer
│   ├── web/               # HTTP frontend (templates embedded)
│   │   └── templates/
│   └── config/            # flags + JSON config
├── go.mod
└── README.md

Requirements

  • Go 1.22+ to build. That's the only build-time dependency.
  • An open UDP port (default 6881) reachable from the public internet. NAT'd setups still work, they just observe fewer hashes.
  • A few hundred MB of disk for the SQLite file once you've collected meaningful data.

No SQLite server, no MongoDB, no C compiler, no other runtime.

Build

git clone https://github.com/botsgalaxy/go-dhtcrawler.git
cd go-dhtcrawler
go build -o dhtcrawler ./cmd/dhtcrawler

That produces a single ~16 MB static binary called dhtcrawler.

If you want a smaller, fully-stripped release build:

CGO_ENABLED=0 go build -trimpath -ldflags="-s -w" -o dhtcrawler ./cmd/dhtcrawler

Run

Quick start (defaults are fine for a first run):

./dhtcrawler

Then open http://127.0.0.1:8000/ in your browser.

The first few minutes will look quiet. The crawler needs to (a) bootstrap into the DHT, (b) be queried by other peers, (c) attempt metadata fetches. Realistically, give it 5–30 minutes before judging whether things are working. Watch the periodic stats line in stderr:

dht: rx_q=12450 rx_r=2031 tx_fn=8200 samples=4112 (+842 in 30s) qlen=10000

samples= is the count of infohashes observed; once that's climbing steadily, you're in. The meta: log lines record successful metadata downloads.

Common flags

-db string             SQLite path (default "dhtcrawler.db")
-dht-addr string       UDP listen address (default "0.0.0.0:6881")
-dht-rate int          find_node packets per second (default 200)
-dht-queue int         max queued nodes for find_node (default 10000)
-meta-workers int      concurrent metadata fetchers (default 20)
-meta-tries int        max attempts before marking a hash failed (5)
-meta-retry-hours int  retry pending hashes after N hours (default 6)
-meta-timeout int      per-fetch timeout in seconds (default 15)
-http-addr string      HTTP UI bind (default "127.0.0.1:8000")
-torrent-dir string    optional dir to save .torrent files (default off)
-config string         load JSON config from file

JSON config

Equivalent to flags. Useful for systemd / docker:

{
  "dht_listen_addr": "0.0.0.0:6881",
  "dht_find_node_rate": 300,
  "meta_workers": 30,
  "http_listen_addr": "0.0.0.0:8000",
  "db_path": "/var/lib/dhtcrawler/data.db",
  "torrent_dir": "/var/lib/dhtcrawler/torrents"
}
./dhtcrawler -config /etc/dhtcrawler.json

Running on a public server

For a non-trivial collection rate, you really do want a public IP and the UDP port forwarded / unblocked. On Linux you may need to bump UDP buffer sizes:

sudo sysctl -w net.core.rmem_max=4194304
sudo sysctl -w net.core.wmem_max=4194304

If you bind to a port < 1024 you'll need CAP_NET_BIND_SERVICE or root. The default 6881 doesn't have this issue.

As a systemd service

# /etc/systemd/system/dhtcrawler.service
[Unit]
Description=go-dhtcrawler
After=network-online.target
Wants=network-online.target

[Service]
ExecStart=/usr/local/bin/dhtcrawler -config /etc/dhtcrawler.json
Restart=on-failure
User=dhtcrawler
StateDirectory=dhtcrawler
WorkingDirectory=/var/lib/dhtcrawler

[Install]
WantedBy=multi-user.target

Using it

Web UI

Path What it shows
/ Recent torrents
/search?q=... Full-text search across names and file paths
/recent 100 most recently added
/top Most-frequently-announced
/stats Counter dashboard
/view/<infohash> Detail page with full file list and magnet link

Search uses SQLite FTS5 with the trigram tokenizer. Every 3-character substring is indexed, which is the only built-in tokenizer that handles non-Latin scripts (Chinese, Japanese, Cyrillic, Arabic, and so on). Each whitespace-separated term of 3+ characters is wrapped in quotes and ANDed, so pink floyd finds rows that match both words anywhere in the name or file paths. Shorter terms fall back to a LIKE scan over names.

Database

The SQLite file is fully usable from outside:

sqlite3 dhtcrawler.db
sqlite> SELECT name, length, announce FROM torrents ORDER BY announce DESC LIMIT 10;
sqlite> SELECT name FROM torrents_fts WHERE torrents_fts MATCH 'ubuntu';

Schema:

table purpose
hashes every observed infohash; state, req_cnt, retry bookkeeping
torrents resolved torrents (one row per infohash with metadata)
files file list for multi-file torrents
torrents_fts FTS5 virtual table over name + file paths

Testing

go test ./...

The internal/bencode package has the most thorough tests; the network subsystems are exercised by the smoke run rather than unit tests.

How it actually finds hashes (Sybil-ish design)

The original Erlang code uses the same trick and it's worth explaining:

When another DHT node sends you get_peers <infohash>, you reply as if your node ID is one bit-byte of distance away from that infohash. Their routing table now has you in the bucket for hashes near that infohash. Next time any node asks them for peers near that hash, you're in the list of "closest" nodes they return.

Multiplied across thousands of peers, a passive crawler ends up receiving an outsized share of get_peers and announce_peer traffic, which is exactly the traffic that reveals active infohashes.

We do this in dht.NeighborID() (see internal/dht/node.go) and apply it in Crawler.handleQuery().

What this rewrite leaves out

The original repo carried a few side projects I didn't port:

  • MongoDB replica set tooling. SQLite is single-node, so it doesn't apply.
  • Sphinx integration. FTS5 covers full-text search on its own.
  • tor_builder / loc_torrent_cache. These were caching layers for torcache-fetched files. We fetch from peers directly, so they're moot.
  • The torcache.net / torrange.com / btbox.n0808.com fetch fallbacks. Those services are dead.

If you want any of this back, the code is small and modular enough that adding it is a hundred-line job rather than a refactor.

Legal

A DHT crawler observes traffic that anyone running a DHT client can observe. Storing what you observe is generally fine. Distributing copyright-infringing content is not. This tool indexes magnet links. Anyone using those links to download content is responsible for their own choices, same as any other search engine. Run it where it's legal to run it.

License

BSD-2-Clause. See LICENSE. The original dhtcrawler2 is also BSD-licensed. Credit to Kevin Lynx for the original design.

Developer

About

BitTorrent DHT crawler in Go. Harvests infohashes from the Mainline DHT, pulls torrent metadata from peers via BEP-9, and indexes it in SQLite with FTS5 for search. Single static binary, no CGO.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors