scrape2postgresql is a demo project that shows how to use Scrapy to scrape web pages and store structured data in a PostgreSQL database using Docker.
The goal of this project is :
- to understand how Scrapy works in real projects
- to learn how to persist scraped data properly
- to follow industry-style containerized workflows
- to avoid common beginner mistakes in scraping setups
This project uses books.toscrape.com as a safe, public demo website.
- Scrapes book title and price
- Accepts a dynamic URL at runtime
- Handles pagination correctly
- Stores data in PostgreSQL
- Runs Scrapy as a one-shot job
- Keeps the database persistent
- Uses Docker Compose for clean service separation
- Uses Makefile for simple developer commands
Most Scrapy tutorials:
- scrape data into CSV/JSON
- run everything locally
- don’t show how real systems are structured
In real-world scraping:
- scrapers are jobs, not long-running services
- databases live independently
- containers are disposable
- configuration must be repeatable
scrape2postgresql demonstrates exactly that.
│ Scrapy (one shot) │ ───▶ │ PostgreSQL (persistent)│
-
Scrapy container:
- starts
- crawls
- inserts data
- exits cleanly
-
PostgreSQL container:
- stays running
- persists data on host machine using Docker volume
- PostgreSQL runs as a service container
- Scrapy runs as a temporary container
- A URL is passed at runtime
- Scrapy:
- requests the page
- follows pagination
- extracts structured data
- writes rows to PostgreSQL
- Scrapy exits
- Database remains available for querying
- Docker Desktop
- docker-compose
- Make (preinstalled on macOS/Linux)
Verify:
docker compose version
make --versionUsage :
🏁 Quick Start (Recommended)
- Start PostgreSQL
make db
- Run the scraper with a URL
make scrape url="https://books.toscrape.com/catalogue/category/books/business_35/index.html"
Scraper runs once Data is inserted to db Container exits automatically
- Query the Database
make psql
then
SELECT title, price FROM books;
| Command | Description |
|---|---|
make db |
Start PostgreSQL |
make scrape url="..." |
Run scraper once |
make psql |
Open PostgreSQL shell |
make logs |
View Postgres logs |
make down |
Stop all containers |
make clean |
Stop containers + delete DB |
make build |
Rebuild scraper image |
make rebuild |
Full rebuild |
make reset-db |
Clear table + reset IDs |
--
-- View all entries
SELECT * FROM books;
-- Only title + price
SELECT title, price FROM books;
-- Count rows
SELECT COUNT(*) FROM books;
-- Reset table
TRUNCATE TABLE books RESTART IDENTITY;
--
- Stop running containers (db) without deleting existing data
make down
- Remove all containers and persisitent data
make clean
- Rebuild from docker-compose
make rebuild
You can easily adapt this project to scrape other websites:
- Change the spider logic Edit:
bookscraper/bookscraper/spiders/books.py
Update selectors Extract more fields Follow detail pages
- Change the database schema
Edit :
pipelines.py
- Add new coloumns
- Normalize data
- Add constraints
- Add more spiders
Create new file under:
bookscraper/bookscraper/spiders/
Each spider can use the same database pipeline
- Use a different website
As long as the site structure is consistent:
- pass a new URL
- update selectors
- reuse everything else
- Consistent environments
- No local dependency issues
- Easy teardown and rebuild
- Clear separation of concerns
- Matches production workflows
This project is for learning and demonstration purposes only. Always:
- check a website’s terms of service
- respect robots.txt
- scrape responsibly
- Zyte API
- Add .env file for configuration
- Track scrape timestamps
- Store price history
- Add FastAPI to expose data
- Schedule runs via cron or CI