A set of tools for controlling processing workflow with spiders and script running in scrapinghub ScrapyCloud.
pip install shub-workflow
If you want to support s3 tools:
pip install shub-workflow[with-s3-tools]
For google cloud storage tools support:
pip install shub-workflow[with-gcs-tools]
Check Project Wiki for documentation. You can also see code tests for lots of examples of usage.
shub-workflow ships a Claude Code plugin, shub-workflow-toolkit, that gives Claude working knowledge of shub-workflow tooling. It currently bundles three skills:
- scanjobs-programs — authoring and running the
scanjobsjob-scanning + plotting tool and its command-line "programs". - shub-workflow-scripts — writing or fixing scripts built on the
shub_workflow.scriptbase classes (BaseScript/BaseLoopScript/BaseLoopScriptAsyncMixin), i.e. any script that runs on or operates on Scrapy Cloud. - shub-workflow-crawl-managers — building, updating or understanding crawl managers
(
CrawlManager/PeriodicCrawlManager/GeneratorCrawlManager/AsyncSchedulerCrawlManagerMixin): theset_parameters_gen()pattern, outcome/retry hooks, and async scheduling.
Install it from this repository's plugin marketplace, from inside Claude Code:
/plugin marketplace add scrapinghub/shub-workflow
/plugin install shub-workflow-toolkit@shub-workflow
To enable it automatically for a project, add it to that project's .claude/settings.json:
{
"enabledPlugins": ["shub-workflow-toolkit@shub-workflow"]
}The plugin is unversioned (its plugin.json has no version field), so each commit pushed to
this repository is a new version. When Claude Code installs the plugin it copies it into a local
cache (~/.claude/plugins/cache/) and uses that copy — it does not read your working tree or
re-pull from GitHub on every session. You choose how new commits reach you:
-
Automatic. Turn on auto-update for this marketplace: run
/plugin, open the Marketplaces tab, and enable auto-update forshub-workflow(or set it in settings — see below). With this on, Claude Code re-pulls the marketplace from GitHub and updates installed plugins at startup, so a new session always loads the latest pushed commit. This is the low-friction option for staying current.{ "extraKnownMarketplaces": { "shub-workflow": { "source": { "source": "github", "repo": "scrapinghub/shub-workflow" }, "autoUpdate": true } }, "enabledPlugins": ["shub-workflow-toolkit@shub-workflow"] } -
Manual. Leave auto-update off (the default for third-party marketplaces). The cached copy stays pinned until you explicitly update — nothing changes under you between sessions. To pull the latest when you want it:
/plugin marketplace update shub-workflow # refresh the catalog from GitHub /plugin update shub-workflow-toolkit@shub-workflow # update the installed plugin
The plugin lives in plugins/shub-workflow-toolkit/; the
marketplace manifest is .claude-plugin/marketplace.json.
The requirements for this library are defined in setup.py as usual. The Pipfile files in the repository don't define dependencies. It is only used for setting up a development environment for shub-workflow library development and testing.
For installing a development environment for shub-workflow, the package comes with Pipfile and Pipfile.lock files. So, clone or fork the repository and do:
> pipenv install --dev
> cp pre-commit .git/hooks/
for installing the environment, and:
> pipenv shell
for initiating it.
There is a script, lint.sh, that you can run everytime you need from the repo root folder, but it is also executed each time you do git commit (provided
you installed the pre-commit hook during the installation step described above). It checks code pep8 and typing integrity, via flake8 and mypy.
> ./lint.sh