GitHub - notneet/nime-crawler: (Under Dev) This is not a bot for crawling anime web (development is held up due to main work)

Disclaimer

This project crawls third-party websites. The legality of web crawling varies by jurisdiction and site terms of service. I take no responsibility for how this software is used. Use at your own risk.

Description

Multi-site anime crawler on a NestJS monorepo. A cron seeder publishes crawl jobs to a RabbitMQ topic exchange; a worker parses pages (XPath or headless browser) and recursively re-publishes next-stage jobs; a sink publishes final structured data to a public results exchange. Sites are declarative SiteAdapter config objects (otakudesu is the reference). All apps are headless RMQ microservices — no HTTP.

Running the crawler

Start RabbitMQ, copy .env.example to .env, then run each app.

Startup order matters. The crawl/parsed exchanges are RabbitMQ topic exchanges, which silently drop messages that have no bound queue. Start the consumers first so their durable queues are declared and bound before any producer publishes:

pnpm run start:worker     # 1. scraper-worker — binds anime.crawl.worker (scale this one)
pnpm run start:sink       # 2. result-sink    — binds anime.parsed.sink
pnpm run start:store      # 3. result-store   — binds anime.results.store, writes SQLite
pnpm run start:control    # 4. control        — control plane (crawl.trigger)
pnpm run start:downloader # 5. downloader     — binds anime.download.archive, archives videos to S3/MinIO
pnpm run start:scheduler  # 6. scheduler      — cron seeder, starts publishing jobs

The downloader needs an S3-compatible target (MinIO) — set the S3_* vars in .env and pre-create the bucket (mc mb local/<bucket>). See downloader.

Bring the scheduler up last; once it runs (or you emit crawl.trigger), index jobs flow into the already-bound worker queue. Queues are durable, so a consumer that restarts later still drains anything published while it was down — only messages published before a queue ever existed are lost.

The HTTP dashboard is independent of the pipeline and can run any time:

DASH_USER=admin DASH_PASS=admin pnpm run start:dashboard  # HTTP UI on :3001

Architecture & per-app docs

See docs/infra/ for the full breakdown:

architecture — overview, exchanges, where results are stored
scheduler — cron seeder
scraper-worker — parse worker
result-sink — results publisher
result-store — SQLite persistence
control — on-demand trigger
downloader — archives episode videos to S3/MinIO
dashboard — HTTP UI to browse/edit/re-crawl stored data

Project setup

$ pnpm install

Compile and run the project

# development
$ pnpm run start

# watch mode
$ pnpm run start:dev

# production mode
$ pnpm run start:prod

Run tests

# unit tests
$ pnpm run test

# e2e tests
$ pnpm run test:e2e

# test coverage
$ pnpm run test:cov

Preview

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
apps		apps
docs/infra		docs/infra
libs/commons		libs/commons
resources		resources
.env.example		.env.example
.gitignore		.gitignore
.prettierrc		.prettierrc
README.md		README.md
eslint.config.mjs		eslint.config.mjs
nest-cli.json		nest-cli.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disclaimer

Description

Running the crawler

Architecture & per-app docs

Project setup

Compile and run the project

Run tests

Preview

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Disclaimer

Description

Running the crawler

Architecture & per-app docs

Project setup

Compile and run the project

Run tests

Preview

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages