A comprehensive Python toolkit for web scraping and AI-generated face collection from ThisPersonDoesNotExist.com. Features both basic and advanced scrapers with concurrent downloading, duplicate detection, and organized storage.
- High-Speed Collection: Download hundreds of unique AI-generated faces
- Concurrent Downloads: Multi-threaded downloading for efficiency
- Duplicate Detection: MD5 hash-based duplicate prevention
- Organized Storage: Automatic batch organization by date/time
- Metadata Tracking: Complete download history and statistics
- Respectful Scraping: Built-in rate limiting and delays
- Basic Web Scraping: Simple webpage content extraction
- Advanced HTML Parsing: BeautifulSoup-powered data extraction
- Multi-format Support: Handle HTML, JSON, and other content types
- Error Handling: Robust error recovery and logging
WebScrapping/
βββ π Face Scrapers
β βββ face_scraper.py # Basic AI face downloader
β βββ advanced_face_scraper.py # Advanced concurrent face scraper
β βββ test_face_scraper.py # Face scraper test utility
βββ π Web Scrapers
β βββ web_scraper.py # Basic web scraping example
β βββ advanced_scraper.py # Advanced HTML parsing scraper
βββ π Configuration
β βββ requirements.txt # Project dependencies
β βββ .gitignore # Git ignore patterns
β βββ README.md # This file
βββ π Output (gitignored)
βββ ai_faces/ # Basic face downloads
βββ ai_faces_collection/ # Advanced organized collection
β βββ daily_batches/ # Batch-organized images
β βββ metadata/ # Download metadata
β βββ duplicates/ # Duplicate detection results
βββ *.json, *.csv # Scraping results and metadata
git clone https://github.com/yourusername/WebScrapping.git
cd WebScrapping# Windows
python -m venv .venv
.venv\Scripts\activate
# macOS/Linux
python3 -m venv .venv
source .venv/bin/activatepip install -r requirements.txtpython test_face_scraper.pypython face_scraper.pypython advanced_face_scraper.pypython web_scraper.pypython advanced_scraper.pyfrom face_scraper import FaceScraper
# Initialize scraper
scraper = FaceScraper(download_folder="my_faces")
# Download single face
scraper.download_single_face("custom_name.jpg")
# Download multiple faces
downloaded_files = scraper.download_multiple_faces(count=20, delay=2)
# Get statistics
stats = scraper.get_download_stats()
print(f"Downloaded {stats['total_files']} images ({stats['total_size_mb']} MB)")from advanced_face_scraper import AdvancedFaceScraper
# Initialize advanced scraper
scraper = AdvancedFaceScraper("large_collection")
# Download with concurrency
results = scraper.download_batch_concurrent(
count=100, # Number of faces
batch_name="collection_v1", # Batch identifier
max_workers=3, # Concurrent downloads
delay=1 # Delay between batches
)
# Get collection summary
summary = scraper.get_collection_summary()from web_scraper import scrape_basic_info
# Scrape webpage information
info = scrape_basic_info('https://example.com')
print(f"Title: {info['title']}")
print(f"Status: {info['status_code']}")from advanced_scraper import WebScraper
# Initialize scraper
scraper = WebScraper()
# Comprehensive website scraping
results = scraper.scrape_website('https://example.com')
# Save results
scraper.save_to_json(results, 'results.json')
scraper.save_links_to_csv(results['links'], 'links.csv')Modify these variables in the scripts:
# Number of faces to download
num_faces = 50
# Delay between downloads (seconds)
delay_seconds = 2
# Concurrent workers (advanced scraper)
max_workers = 3
# Custom folder name
download_folder = "my_custom_folder"# Request timeout
timeout = 10
# Custom headers
headers = {
'User-Agent': 'Your Custom User Agent'
}
# Rate limiting delay
delay = 1ai_faces_collection/
βββ daily_batches/
β βββ collection_20250628_1430/
β β βββ face_20250628_143022_a1b2c3d4.jpg
β β βββ face_20250628_143025_e5f6g7h8.jpg
β β βββ ... (more faces)
βββ metadata/
β βββ collection_20250628_1430_metadata.json
βββ duplicates/
βββ dup_a1b2c3d4.jpg
{
"url": "https://example.com",
"timestamp": "2025-06-28 14:30:22",
"content": {
"title": "Example Domain",
"headings": [...],
"paragraphs": [...],
"lists": [...]
},
"links": [...],
"images": [...]
}- β Rate Limiting: Built-in delays prevent server overload
- β Proper Headers: Realistic User-Agent strings
- β Error Handling: Graceful failure and retry logic
- β Legal Compliance: Respect robots.txt and terms of service
- β AI-Generated: All faces are AI-created, not real people
- β Ethical Use: Suitable for research, testing, and development
- β No Personal Data: No real personal information collected
β οΈ Usage Responsibility: Ensure appropriate use in your projects
Problem: pip install requests not recognized
# Solution: Use full path or activate virtual environment
.venv\Scripts\pip.exe install requestsProblem: Face scraper not downloading
# Solution: Test connectivity
python test_face_scraper.pyProblem: Permission denied errors
# Solution: Check folder permissions or run as administratorProblem: Import errors
# Solution: Ensure virtual environment is activated
.venv\Scripts\activate # Windows
source .venv/bin/activate # macOS/Linuxrequests>=2.31.0 # HTTP requests and session management
beautifulsoup4>=4.12.0 # HTML parsing and data extraction
lxml>=4.9.0 # XML/HTML parser backend
Pillow>=10.0.0 # Image processing capabilities
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
This tool is for educational and research purposes. Users are responsible for ensuring compliance with applicable laws and website terms of service. The developers are not responsible for any misuse of this software.
- ThisPersonDoesNotExist.com for providing AI-generated faces
- Requests for HTTP functionality
- BeautifulSoup for HTML parsing