Skip to content

notneet/nime-crawler

Repository files navigation

Disclaimer

This project crawls third-party websites. The legality of web crawling varies by jurisdiction and site terms of service. I take no responsibility for how this software is used. Use at your own risk.

Description

Multi-site anime crawler on a NestJS monorepo. A cron seeder publishes crawl jobs to a RabbitMQ topic exchange; a worker parses pages (XPath or headless browser) and recursively re-publishes next-stage jobs; a sink publishes final structured data to a public results exchange. Sites are declarative SiteAdapter config objects (otakudesu is the reference). All apps are headless RMQ microservices — no HTTP.

Running the crawler

Start RabbitMQ, copy .env.example to .env, then run each app.

Startup order matters. The crawl/parsed exchanges are RabbitMQ topic exchanges, which silently drop messages that have no bound queue. Start the consumers first so their durable queues are declared and bound before any producer publishes:

pnpm run start:worker     # 1. scraper-worker — binds anime.crawl.worker (scale this one)
pnpm run start:sink       # 2. result-sink    — binds anime.parsed.sink
pnpm run start:store      # 3. result-store   — binds anime.results.store, writes SQLite
pnpm run start:control    # 4. control        — control plane (crawl.trigger)
pnpm run start:downloader # 5. downloader     — binds anime.download.archive, archives videos to S3/MinIO
pnpm run start:scheduler  # 6. scheduler      — cron seeder, starts publishing jobs

The downloader needs an S3-compatible target (MinIO) — set the S3_* vars in .env and pre-create the bucket (mc mb local/<bucket>). See downloader.

Bring the scheduler up last; once it runs (or you emit crawl.trigger), index jobs flow into the already-bound worker queue. Queues are durable, so a consumer that restarts later still drains anything published while it was down — only messages published before a queue ever existed are lost.

The HTTP dashboard is independent of the pipeline and can run any time:

DASH_USER=admin DASH_PASS=admin pnpm run start:dashboard  # HTTP UI on :3001

Architecture & per-app docs

See docs/infra/ for the full breakdown:

Project setup

$ pnpm install

Compile and run the project

# development
$ pnpm run start

# watch mode
$ pnpm run start:dev

# production mode
$ pnpm run start:prod

Run tests

# unit tests
$ pnpm run test

# e2e tests
$ pnpm run test:e2e

# test coverage
$ pnpm run test:cov

Preview

image image image image

About

(Under Dev) This is not a bot for crawling anime web (development is held up due to main work)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors