A fully functional, educational search engine demonstrating core concepts of web crawling, inverted indexing, and relevance-based ranking—built entirely with native Python libraries (no Scrapy, BeautifulSoup, or external parsers).
Course: AI Aided Computer Engineering (İTÜ) Reference: Gold Standard Repository
- Features
- Architecture
- Quick Start
- Usage Guide
- API Reference
- Testing
- Technical Highlights
- Project Structure
- Deliverables
- ✅ Multi-threaded Design: Each crawler runs in its own thread (
threading.Thread) - ✅ Pause/Resume/Stop: Event-based control for crawler lifecycle
- ✅ State Persistence: Resume interrupted crawls from saved checkpoints
- ✅ Back-Pressure Control: Queue capacity limits prevent memory exhaustion
- ✅ Rate Limiting: Configurable requests per second (0.1 - 1000 req/s)
- ✅ SSL Fallback: Secure context first, permissive fallback for certificate issues
- ✅ Depth-Bounded: Configurable max crawl depth (1-1000)
- ✅ Letter-Based Storage: Words indexed in
a.datathroughz.datafiles - ✅ Thread-Safe Writes: Per-letter locks for concurrent crawler support
- ✅ Frequency Tracking: Word occurrence counts per URL
- ✅ Depth Metadata: Distance from seed URL stored for ranking
- ✅ Relevance Scoring:
frequency × depth_boost × match_quality - ✅ Multiple Sort Options: Relevance, Frequency, or Depth
- ✅ Pagination Support: Configurable page size and offset
- ✅ Multi-Word Queries: Searches across multiple index files
- ✅ Real-Time Monitoring: Auto-refresh status every 2 seconds
- ✅ Live Logs: Scrolling log feed with timestamps
- ✅ Crawler Control: Start, Pause, Resume, Stop buttons
- ✅ Statistics Dashboard: URLs crawled, queued, failed, indexed words
- ✅ Search Interface: Google-style search with "I'm Feeling Lucky"
┌─────────────────────────────────────────────────────────────────┐
│ Flask REST API │
│ (app.py:3600) │
├──────────────────┬──────────────────┬───────────────────────────┤
│ /crawler/* │ /search │ Static Files │
│ endpoints │ endpoint │ (/demo/*) │
└────────┬─────────┴────────┬─────────┴───────────────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ CrawlerService │ │ SearchService │
│ - create() │ │ - search() │
│ - pause() │ │ - rank() │
│ - resume() │ │ - paginate() │
│ - stop() │ │ │
└────────┬────────┘ └────────┬────────┘
│ │
▼ │
┌─────────────────┐ │
│ CrawlerJob │ │
│ (Thread Pool) │ │
│ - _pause_event │ │
│ - _stop_event │ │
│ - url_queue │ │
└────────┬────────┘ │
│ │
▼ ▼
┌─────────────────────────────────────────┐
│ File-Based Persistence │
│ ┌─────────────┐ ┌─────────────────┐ │
│ │ storage/*.data│ │ crawlers/*.data │ │
│ │ (word index) │ │ (crawler state)│ │
│ └─────────────┘ └─────────────────┘ │
└─────────────────────────────────────────┘
| Resource | Protection | Pattern |
|---|---|---|
| URL Queue | queue.Queue(maxsize=N) |
Built-in thread-safe FIFO |
| Pause Signal | threading.Event() |
.wait() blocks, .set() releases |
| Stop Signal | threading.Event() |
Non-blocking check with .is_set() |
| Visited URLs | threading.Lock |
Mutual exclusion for set operations |
| Index Files | Per-letter threading.Lock |
Fine-grained locking (a.data, b.data separate) |
| File I/O | Atomic write-temp-rename | os.replace() prevents partial writes |
# 1. Queue capacity limit enforces memory bounds
url_queue = queue.Queue(maxsize=10000) # Default
# 2. Timeout-based put (non-blocking)
url_queue.put((url, depth), timeout=0.1)
# → queue.Full: Skip URL (back-pressure active)
# 3. Session URL limit
if urls_crawled >= max_urls_to_visit:
break # Stop crawling
# 4. Rate limiting
sleep(1.0 / hit_rate) # Enforce requests/second- Python 3.8 or higher
- pip (Python package manager)
# Clone repository
git clone https://github.com/yourusername/google-in-a-day.git
cd google-in-a-day
# Install dependencies
pip install -r requirements.txt# Start Flask API on port 3600
python app.pyhttp://localhost:3600/
Via Dashboard:
- Navigate to
http://localhost:3600/crawler - Enter seed URL (e.g.,
https://example.com) - Configure parameters:
- Max Depth: How deep to crawl (1-1000)
- Hit Rate: Requests per second (0.1-1000)
- Max URLs: Session limit (1-10,000)
- Queue Capacity: Memory limit (100-100,000)
- Click Start Crawler
Via API:
curl -X POST http://localhost:3600/crawler/create \
-H "Content-Type: application/json" \
-d '{
"origin": "https://example.com",
"max_depth": 3,
"hit_rate": 100,
"max_urls_to_visit": 1000
}'Response:
{
"crawler_id": "1679404800_12345",
"status": "Active"
}Via Dashboard:
- Navigate to
http://localhost:3600/status - Select crawler from dropdown
- View real-time statistics, logs, and progress
Via API:
curl http://localhost:3600/crawler/status/1679404800_12345Pause:
curl -X POST http://localhost:3600/crawler/pause/1679404800_12345Resume:
curl -X POST http://localhost:3600/crawler/resume/1679404800_12345Stop:
curl -X POST http://localhost:3600/crawler/stop/1679404800_12345Resume from Files (after stop):
curl -X POST http://localhost:3600/crawler/resume-from-files/1679404800_12345Via Dashboard:
- Navigate to
http://localhost:3600/search-page - Enter query (e.g.,
python programming) - Select sort option (Relevance, Frequency, Depth)
- Click Search or I'm Feeling Lucky
Via API:
curl "http://localhost:3600/search?query=python&pageLimit=10&pageOffset=0&sortBy=relevance"Response:
{
"query": "python",
"query_words": ["python"],
"results": [
{
"word": "python",
"url": "https://docs.python.org/tutorial",
"origin": "https://python.org",
"depth": 1,
"frequency": 23,
"score": 8.45
}
],
"total_results": 156,
"page_limit": 10,
"page_offset": 0
}| Method | Endpoint | Description |
|---|---|---|
| POST | /crawler/create |
Start new crawler |
| GET | /crawler/status/<id> |
Get crawler status |
| GET | /crawler/list |
List all crawlers |
| POST | /crawler/pause/<id> |
Pause active crawler |
| POST | /crawler/resume/<id> |
Resume paused crawler |
| POST | /crawler/stop/<id> |
Stop crawler |
| POST | /crawler/resume-from-files/<id> |
Resume from saved state |
| POST | /crawler/clear |
Clear all data |
| GET | /crawler/stats |
Get aggregate statistics |
| Method | Endpoint | Description |
|---|---|---|
| GET | /search |
Search indexed content |
| GET | /search/random |
Get random word (I'm Feeling Lucky) |
| GET | /index/stats |
Get index statistics |
| Parameter | Type | Default | Description |
|---|---|---|---|
query |
string | Required | Search query |
pageLimit |
int | 10 | Results per page (1-100) |
pageOffset |
int | 0 | Results to skip |
sortBy |
string | relevance | Sort: relevance, frequency, depth |
# HTML Parser tests (21 tests)
python -m unittest utils.__test__.test_html_parser -v
# Crawler Job tests (20+ tests)
python -m unittest utils.__test__.test_crawler_job -v
# Run all tests
python -m unittest discover -s utils/__test__ -v| Module | Tests | Coverage |
|---|---|---|
html_parser.py |
21 | 90%+ |
crawler_job.py |
20+ | 85%+ |
- Start crawler with seed URL
- Verify URLs being crawled in logs
- Pause crawler, verify state freezes
- Resume crawler, verify continuation
- Stop crawler, verify graceful shutdown
- Resume from files, verify queue restored
- Search for indexed words
- Verify pagination works
- Test different sort options
- Clear all data, verify cleanup
class CrawlerJob(threading.Thread):
def __init__(self):
self._pause_event = threading.Event() # .set() = running
self._stop_event = threading.Event() # .set() = stop
def run(self):
while not self._stop_event.is_set():
self._pause_event.wait() # BLOCKS if paused
# ... crawl logic ...Why This Works:
_pause_event.wait(): Blocks thread until event is set (running state)pause(): Clears event → thread blocks on next iterationresume(): Sets event → thread unblocks and continues- No busy-waiting, no polling, no CPU waste
# Class-level lock dictionary
_index_locks: Dict[str, threading.Lock] = {}
@classmethod
def _get_index_lock(cls, letter: str) -> threading.Lock:
with cls._index_locks_lock:
if letter not in cls._index_locks:
cls._index_locks[letter] = threading.Lock()
return cls._index_locks[letter]
def _index_words(self, url: str, word_counts: Counter, depth: int):
for word, freq in word_counts.items():
letter = word[0].lower()
lock = self._get_index_lock(letter)
with lock: # Only this letter file locked
with open(f"{letter}.data", 'a') as f:
f.write(f"{word} {url} {self.origin} {depth} {freq}\n")Benefits:
- Multiple crawlers can write to different letters simultaneously
- Prevents file corruption from concurrent writes
- Better parallelism than single global lock
def _atomic_write(self, filepath: str, content: str):
temp_path = filepath + ".tmp"
with open(temp_path, 'w') as f:
f.write(content)
os.replace(temp_path, filepath) # ATOMIC on most systemsPrevents:
- Partial writes on crash
- Reader seeing half-written data
- File corruption
project-google/
├── app.py # Flask REST API (port 3600)
├── product_prd.md # Product Requirements Document
├── recommendation.md # Production roadmap
├── README.md # This file
├── requirements.txt # Dependencies (flask, flask-cors)
│
├── services/
│ ├── __init__.py
│ ├── crawler_service.py # Crawler lifecycle management
│ └── search_service.py # Search and ranking logic
│
├── utils/
│ ├── __init__.py
│ ├── crawler_job.py # CrawlerJob(threading.Thread)
│ └── __test__/
│ ├── __init__.py
│ ├── test_html_parser.py # HTML parser tests
│ └── test_crawler_job.py # Crawler tests
│
├── demo/
│ ├── crawler.html # Crawler control dashboard
│ ├── status.html # Real-time monitoring
│ └── search.html # Search interface
│
└── data/ # Auto-created at runtime
├── storage/ # Index files (a.data - z.data)
├── crawlers/ # Crawler state ({id}.data, .logs, .queue)
└── visited_urls.data # Global visited URL set
| Deliverable | Status | Path |
|---|---|---|
| Codebase | ✅ Complete | Entire repository |
| PRD | ✅ Complete | product_prd.md |
| README | ✅ Complete | README.md |
| Recommendation | ✅ Complete | recommendation.md |
This project demonstrates:
- Concurrency: Threading, locks, events, queues
- Web Technologies: HTTP, HTML parsing, URL normalization
- Data Structures: Inverted index, FIFO queue, hash sets
- File I/O: Atomic writes, checkpointing, crash recovery
- API Design: RESTful endpoints, JSON responses
- Frontend: HTML/CSS/JS without frameworks
AI Stewardship Questions to Prepare:
-
Why threading instead of multiprocessing?
- CPython GIL limits true parallelism, but threading is sufficient for I/O-bound crawling
queue.Queueis thread-safe but not process-safe- Lower memory overhead (shared memory vs. IPC)
-
How is back-pressure implemented?
queue.Queue(maxsize=N)blocks/times out when fullput(timeout=0.1)skips URLs if queue at capacitymax_urls_to_visithard limit prevents infinite crawling
-
What prevents race conditions?
threading.Lockforvisited_urlssetthreading.Eventfor pause/stop signals (atomic)- Per-letter locks for index file writes
- Atomic file operations (
os.replace())
- Port 3600 chosen to match Gold Standard repository
- Only native Python libraries used (no Scrapy, BeautifulSoup, requests)
- Flask is the only external dependency (allowed per course rules)
- Thread-safety prioritized throughout (locks, events, atomic operations)
MIT License - Educational project for ITU AI Aided Computer Engineering course.
- Gold Standard Repository: ereneld/crawler
- Course Instructor: ITU AI Aided Computer Engineering
- Reference Materials: Product PRD, course slides, technical specifications
Built with ❤️ using only Python standard library + Flask