This project crawls third-party websites. The legality of web crawling varies by jurisdiction and site terms of service. I take no responsibility for how this software is used. Use at your own risk.
Multi-site anime crawler on a NestJS monorepo. A cron seeder publishes crawl jobs
to a RabbitMQ topic exchange; a worker parses pages (XPath or headless browser) and
recursively re-publishes next-stage jobs; a sink publishes final structured data to
a public results exchange. Sites are declarative SiteAdapter config objects
(otakudesu is the reference). All apps are headless RMQ microservices — no HTTP.
Start RabbitMQ, copy .env.example to .env, then run each app.
Startup order matters. The crawl/parsed exchanges are RabbitMQ topic exchanges, which silently drop messages that have no bound queue. Start the consumers first so their durable queues are declared and bound before any producer publishes:
pnpm run start:worker # 1. scraper-worker — binds anime.crawl.worker (scale this one)
pnpm run start:sink # 2. result-sink — binds anime.parsed.sink
pnpm run start:store # 3. result-store — binds anime.results.store, writes SQLite
pnpm run start:control # 4. control — control plane (crawl.trigger)
pnpm run start:downloader # 5. downloader — binds anime.download.archive, archives videos to S3/MinIO
pnpm run start:scheduler # 6. scheduler — cron seeder, starts publishing jobsThe downloader needs an S3-compatible target (MinIO) — set the S3_* vars in
.env and pre-create the bucket (mc mb local/<bucket>). See
downloader.
Bring the scheduler up last; once it runs (or you emit crawl.trigger), index
jobs flow into the already-bound worker queue. Queues are durable, so a
consumer that restarts later still drains anything published while it was down —
only messages published before a queue ever existed are lost.
The HTTP dashboard is independent of the pipeline and can run any time:
DASH_USER=admin DASH_PASS=admin pnpm run start:dashboard # HTTP UI on :3001See docs/infra/ for the full breakdown:
- architecture — overview, exchanges, where results are stored
- scheduler — cron seeder
- scraper-worker — parse worker
- result-sink — results publisher
- result-store — SQLite persistence
- control — on-demand trigger
- downloader — archives episode videos to S3/MinIO
- dashboard — HTTP UI to browse/edit/re-crawl stored data
$ pnpm install# development
$ pnpm run start
# watch mode
$ pnpm run start:dev
# production mode
$ pnpm run start:prod# unit tests
$ pnpm run test
# e2e tests
$ pnpm run test:e2e
# test coverage
$ pnpm run test:cov