🕷️ AI Face Scraping & Web Scraping Toolkit

A comprehensive Python toolkit for web scraping and AI-generated face collection from ThisPersonDoesNotExist.com. Features both basic and advanced scrapers with concurrent downloading, duplicate detection, and organized storage.

🚀 Features

🎭 AI Face Scraping

High-Speed Collection: Download hundreds of unique AI-generated faces
Concurrent Downloads: Multi-threaded downloading for efficiency
Duplicate Detection: MD5 hash-based duplicate prevention
Organized Storage: Automatic batch organization by date/time
Metadata Tracking: Complete download history and statistics
Respectful Scraping: Built-in rate limiting and delays

🌐 General Web Scraping

Basic Web Scraping: Simple webpage content extraction
Advanced HTML Parsing: BeautifulSoup-powered data extraction
Multi-format Support: Handle HTML, JSON, and other content types
Error Handling: Robust error recovery and logging

📁 Project Structure

WebScrapping/
├── 🎭 Face Scrapers
│   ├── face_scraper.py              # Basic AI face downloader
│   ├── advanced_face_scraper.py     # Advanced concurrent face scraper
│   └── test_face_scraper.py         # Face scraper test utility
├── 🌐 Web Scrapers
│   ├── web_scraper.py               # Basic web scraping example
│   └── advanced_scraper.py          # Advanced HTML parsing scraper
├── 📋 Configuration
│   ├── requirements.txt             # Project dependencies
│   ├── .gitignore                   # Git ignore patterns
│   └── README.md                    # This file
└── 📊 Output (gitignored)
    ├── ai_faces/                    # Basic face downloads
    ├── ai_faces_collection/         # Advanced organized collection
    │   ├── daily_batches/           # Batch-organized images
    │   ├── metadata/                # Download metadata
    │   └── duplicates/              # Duplicate detection results
    └── *.json, *.csv               # Scraping results and metadata

🛠️ Installation & Setup

1. Clone the Repository

git clone https://github.com/yourusername/WebScrapping.git
cd WebScrapping

2. Create Virtual Environment

# Windows
python -m venv .venv
.venv\Scripts\activate

# macOS/Linux
python3 -m venv .venv
source .venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

🎯 Quick Start

Test Face Scraper

python test_face_scraper.py

Download 10 AI Faces (Basic)

python face_scraper.py

Download 50+ Faces (Advanced with Concurrency)

python advanced_face_scraper.py

Basic Web Scraping

python web_scraper.py

Advanced Web Scraping

python advanced_scraper.py

📚 Detailed Usage

🎭 Face Scraping

Basic Face Scraper

from face_scraper import FaceScraper

# Initialize scraper
scraper = FaceScraper(download_folder="my_faces")

# Download single face
scraper.download_single_face("custom_name.jpg")

# Download multiple faces
downloaded_files = scraper.download_multiple_faces(count=20, delay=2)

# Get statistics
stats = scraper.get_download_stats()
print(f"Downloaded {stats['total_files']} images ({stats['total_size_mb']} MB)")

Advanced Face Scraper

from advanced_face_scraper import AdvancedFaceScraper

# Initialize advanced scraper
scraper = AdvancedFaceScraper("large_collection")

# Download with concurrency
results = scraper.download_batch_concurrent(
    count=100,                    # Number of faces
    batch_name="collection_v1",   # Batch identifier
    max_workers=3,                # Concurrent downloads
    delay=1                       # Delay between batches
)

# Get collection summary
summary = scraper.get_collection_summary()

🌐 Web Scraping

Basic Web Scraping

from web_scraper import scrape_basic_info

# Scrape webpage information
info = scrape_basic_info('https://example.com')
print(f"Title: {info['title']}")
print(f"Status: {info['status_code']}")

Advanced Web Scraping

from advanced_scraper import WebScraper

# Initialize scraper
scraper = WebScraper()

# Comprehensive website scraping
results = scraper.scrape_website('https://example.com')

# Save results
scraper.save_to_json(results, 'results.json')
scraper.save_links_to_csv(results['links'], 'links.csv')

⚙️ Configuration

Face Scraper Settings

Modify these variables in the scripts:

# Number of faces to download
num_faces = 50

# Delay between downloads (seconds)
delay_seconds = 2

# Concurrent workers (advanced scraper)
max_workers = 3

# Custom folder name
download_folder = "my_custom_folder"

Web Scraper Settings

# Request timeout
timeout = 10

# Custom headers
headers = {
    'User-Agent': 'Your Custom User Agent'
}

# Rate limiting delay
delay = 1

📊 Output Examples

Face Scraper Output Structure

ai_faces_collection/
├── daily_batches/
│   ├── collection_20250628_1430/
│   │   ├── face_20250628_143022_a1b2c3d4.jpg
│   │   ├── face_20250628_143025_e5f6g7h8.jpg
│   │   └── ... (more faces)
├── metadata/
│   └── collection_20250628_1430_metadata.json
└── duplicates/
    └── dup_a1b2c3d4.jpg

Web Scraper JSON Output

{
  "url": "https://example.com",
  "timestamp": "2025-06-28 14:30:22",
  "content": {
    "title": "Example Domain",
    "headings": [...],
    "paragraphs": [...],
    "lists": [...]
  },
  "links": [...],
  "images": [...]
}

🚦 Best Practices & Ethics

🤝 Respectful Scraping

✅ Rate Limiting: Built-in delays prevent server overload
✅ Proper Headers: Realistic User-Agent strings
✅ Error Handling: Graceful failure and retry logic
✅ Legal Compliance: Respect robots.txt and terms of service

🔒 AI Face Usage

✅ AI-Generated: All faces are AI-created, not real people
✅ Ethical Use: Suitable for research, testing, and development
✅ No Personal Data: No real personal information collected
⚠️ Usage Responsibility: Ensure appropriate use in your projects

🛡️ Troubleshooting

Common Issues

Problem: pip install requests not recognized

# Solution: Use full path or activate virtual environment
.venv\Scripts\pip.exe install requests

Problem: Face scraper not downloading

# Solution: Test connectivity
python test_face_scraper.py

Problem: Permission denied errors

# Solution: Check folder permissions or run as administrator

Problem: Import errors

# Solution: Ensure virtual environment is activated
.venv\Scripts\activate  # Windows
source .venv/bin/activate  # macOS/Linux

📦 Dependencies

requests>=2.31.0        # HTTP requests and session management
beautifulsoup4>=4.12.0  # HTML parsing and data extraction
lxml>=4.9.0            # XML/HTML parser backend
Pillow>=10.0.0         # Image processing capabilities

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Disclaimer

This tool is for educational and research purposes. Users are responsible for ensuring compliance with applicable laws and website terms of service. The developers are not responsible for any misuse of this software.

🙏 Acknowledgments

ThisPersonDoesNotExist.com for providing AI-generated faces
Requests for HTTP functionality
BeautifulSoup for HTML parsing

Made with ❤️ for the scraping community

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
loomis_test_output		loomis_test_output
.gitignore		.gitignore
COMPLETE_SCIENCE_FAIR_PACKAGE.md		COMPLETE_SCIENCE_FAIR_PACKAGE.md
GITHUB_UPLOAD_GUIDE.md		GITHUB_UPLOAD_GUIDE.md
GOOGLE_DOCS_READY_REPORT.html		GOOGLE_DOCS_READY_REPORT.html
LICENSE		LICENSE
LOOMIS_PROJECT_PLAN.md		LOOMIS_PROJECT_PLAN.md
LOOMIS_SUCCESS_REPORT.md		LOOMIS_SUCCESS_REPORT.md
Loomis Head Method ML Project.docx		Loomis Head Method ML Project.docx
Loomis_Method_Visualization.ipynb		Loomis_Method_Visualization.ipynb
README.md		README.md
SCIENCE_FAIR_FINAL_REPORT.md		SCIENCE_FAIR_FINAL_REPORT.md
SCIENCE_FAIR_PORTFOLIO.md		SCIENCE_FAIR_PORTFOLIO.md
SCIENCE_FAIR_POSTER_DESIGN.md		SCIENCE_FAIR_POSTER_DESIGN.md
SCIENCE_FAIR_PRESENTATION.md		SCIENCE_FAIR_PRESENTATION.md
advanced_face_scraper.py		advanced_face_scraper.py
advanced_scraper.py		advanced_scraper.py
celebration.html		celebration.html
create_complete_loomis_grid.py		create_complete_loomis_grid.py
create_demonstration_image.py		create_demonstration_image.py
create_enhanced_demo.py		create_enhanced_demo.py
create_enhanced_eye_jaw_demo.py		create_enhanced_eye_jaw_demo.py
create_extreme_contrast_demo.py		create_extreme_contrast_demo.py
create_fixed_spline_demo.py		create_fixed_spline_demo.py
create_maximum_visibility_demo.py		create_maximum_visibility_demo.py
create_spline_demo.py		create_spline_demo.py
create_super_visible_demo.py		create_super_visible_demo.py
enhanced_visualization.html		enhanced_visualization.html
environment_test_results.json		environment_test_results.json
extreme_contrast_comparison.html		extreme_contrast_comparison.html
face_scraper.py		face_scraper.py
fixed_spline_display.html		fixed_spline_display.html
integration_test.py		integration_test.py
loomis_analyzer.py		loomis_analyzer.py
loomis_demonstration.py		loomis_demonstration.py
loomis_grid_visualizer.py		loomis_grid_visualizer.py
loomis_visualization.html		loomis_visualization.html
mediapipe_face_detector.py		mediapipe_face_detector.py
minimal_display.html		minimal_display.html
quick_loomis_test.py		quick_loomis_test.py
quick_test.py		quick_test.py
requirements.txt		requirements.txt
science_fair_demo.py		science_fair_demo.py
setup_github.bat		setup_github.bat
simple_face_detector.py		simple_face_detector.py
simple_fixed_display.html		simple_fixed_display.html
simple_loomis_view.html		simple_loomis_view.html
simple_spline_view.html		simple_spline_view.html
spline_loomis_display.html		spline_loomis_display.html
spline_loomis_visualizer.py		spline_loomis_visualizer.py
stage1_analysis.json		stage1_analysis.json
super_visible_loomis_display.html		super_visible_loomis_display.html
test_environment.py		test_environment.py
test_face_scraper.py		test_face_scraper.py
test_fixed_spline.py		test_fixed_spline.py
test_loomis_analyzer.py		test_loomis_analyzer.py
test_mediapipe_detector.py		test_mediapipe_detector.py
web_scraper.py		web_scraper.py
working_fixed_display.html		working_fixed_display.html
working_spline_display.html		working_spline_display.html

Folders and files

Latest commit

History

Repository files navigation

🕷️ AI Face Scraping & Web Scraping Toolkit

🚀 Features

🎭 AI Face Scraping

🌐 General Web Scraping

📁 Project Structure

🛠️ Installation & Setup

1. Clone the Repository

2. Create Virtual Environment

3. Install Dependencies

🎯 Quick Start

Test Face Scraper

Download 10 AI Faces (Basic)

Download 50+ Faces (Advanced with Concurrency)

Basic Web Scraping

Advanced Web Scraping

📚 Detailed Usage

🎭 Face Scraping

Basic Face Scraper

Advanced Face Scraper

🌐 Web Scraping

Basic Web Scraping

Advanced Web Scraping

⚙️ Configuration

Face Scraper Settings

Web Scraper Settings

📊 Output Examples

Face Scraper Output Structure

Web Scraper JSON Output

🚦 Best Practices & Ethics

🤝 Respectful Scraping

🔒 AI Face Usage

🛡️ Troubleshooting

Common Issues

📦 Dependencies

🤝 Contributing

📄 License

⚠️ Disclaimer

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages