🕷️ scrape2postgresql

scrape2postgresql is a demo project that shows how to use Scrapy to scrape web pages and store structured data in a PostgreSQL database using Docker.

The goal of this project is :

to understand how Scrapy works in real projects
to learn how to persist scraped data properly
to follow industry-style containerized workflows
to avoid common beginner mistakes in scraping setups

This project uses books.toscrape.com as a safe, public demo website.

🚀 What this project does

Scrapes book title and price
Accepts a dynamic URL at runtime
Handles pagination correctly
Stores data in PostgreSQL
Runs Scrapy as a one-shot job
Keeps the database persistent
Uses Docker Compose for clean service separation
Uses Makefile for simple developer commands

🧠 Why this project exists

Most Scrapy tutorials:

scrape data into CSV/JSON
run everything locally
don’t show how real systems are structured

In real-world scraping:

scrapers are jobs, not long-running services
databases live independently
containers are disposable
configuration must be repeatable

scrape2postgresql demonstrates exactly that.

🏗 Architecture (High level)

│ Scrapy (one shot) │ ───▶ │ PostgreSQL (persistent)│

Scrapy container:
- starts
- crawls
- inserts data
- exits cleanly
PostgreSQL container:
- stays running
- persists data on host machine using Docker volume

⚙️ How it works (step-by-step)

PostgreSQL runs as a service container
Scrapy runs as a temporary container
A URL is passed at runtime
Scrapy:
- requests the page
- follows pagination
- extracts structured data
- writes rows to PostgreSQL
Scrapy exits
Database remains available for querying

🧰 Prerequisites

Docker Desktop
docker-compose
Make (preinstalled on macOS/Linux)

Verify:

docker compose version
make --version

Usage :

🏁 Quick Start (Recommended)

Start PostgreSQL

make db

Run the scraper with a URL

make scrape url="https://books.toscrape.com/catalogue/category/books/business_35/index.html"

Scraper runs once Data is inserted to db Container exits automatically

Query the Database

make psql

then

SELECT title, price FROM books;

Makefile commands

Command	Description
`make db`	Start PostgreSQL
`make scrape url="..."`	Run scraper once
`make psql`	Open PostgreSQL shell
`make logs`	View Postgres logs
`make down`	Stop all containers
`make clean`	Stop containers + delete DB
`make build`	Rebuild scraper image
`make rebuild`	Full rebuild
`make reset-db`	Clear table + reset IDs

--

Useful SQL commands to query the database

-- View all entries
SELECT * FROM books;

-- Only title + price
SELECT title, price FROM books;

-- Count rows
SELECT COUNT(*) FROM books;

-- Reset table
TRUNCATE TABLE books RESTART IDENTITY;

--

Teardown

Stop running containers (db) without deleting existing data

make down

Remove all containers and persisitent data

make clean

Rebuild from docker-compose

make rebuild

🔧 How to adapt this project for your own scraping needs

You can easily adapt this project to scrape other websites:

Change the spider logic Edit:

bookscraper/bookscraper/spiders/books.py

Update selectors Extract more fields Follow detail pages

Change the database schema

Edit :

pipelines.py

Add new coloumns
Normalize data
Add constraints

Add more spiders

Create new file under:

bookscraper/bookscraper/spiders/

Each spider can use the same database pipeline

Use a different website

As long as the site structure is consistent:

pass a new URL
update selectors
reuse everything else

🐳 Why Docker Compose?

Consistent environments
No local dependency issues
Easy teardown and rebuild
Clear separation of concerns
Matches production workflows

⚠️ Disclaimer

This project is for learning and demonstration purposes only. Always:

check a website’s terms of service
respect robots.txt
scrape responsibly

🌱 Next improvements (ideas)

Zyte API
Add .env file for configuration
Track scrape timestamps
Store price history
Add FastAPI to expose data
Schedule runs via cron or CI

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
bookscraper		bookscraper
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
run_spider.sh		run_spider.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕷️ scrape2postgresql

🚀 What this project does

🧠 Why this project exists

🏗 Architecture (High level)

⚙️ How it works (step-by-step)

🧰 Prerequisites

Makefile commands

Useful SQL commands to query the database

Teardown

🔧 How to adapt this project for your own scraping needs

🐳 Why Docker Compose?

⚠️ Disclaimer

🌱 Next improvements (ideas)

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🕷️ scrape2postgresql

🚀 What this project does

🧠 Why this project exists

🏗 Architecture (High level)

⚙️ How it works (step-by-step)

🧰 Prerequisites

Makefile commands

Useful SQL commands to query the database

Teardown

🔧 How to adapt this project for your own scraping needs

🐳 Why Docker Compose?

⚠️ Disclaimer

🌱 Next improvements (ideas)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages