Skip to content

SIAFI-UNAM/LoominisingProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ•·οΈ AI Face Scraping & Web Scraping Toolkit

Python License Code Style

A comprehensive Python toolkit for web scraping and AI-generated face collection from ThisPersonDoesNotExist.com. Features both basic and advanced scrapers with concurrent downloading, duplicate detection, and organized storage.

πŸš€ Features

🎭 AI Face Scraping

  • High-Speed Collection: Download hundreds of unique AI-generated faces
  • Concurrent Downloads: Multi-threaded downloading for efficiency
  • Duplicate Detection: MD5 hash-based duplicate prevention
  • Organized Storage: Automatic batch organization by date/time
  • Metadata Tracking: Complete download history and statistics
  • Respectful Scraping: Built-in rate limiting and delays

🌐 General Web Scraping

  • Basic Web Scraping: Simple webpage content extraction
  • Advanced HTML Parsing: BeautifulSoup-powered data extraction
  • Multi-format Support: Handle HTML, JSON, and other content types
  • Error Handling: Robust error recovery and logging

πŸ“ Project Structure

WebScrapping/
β”œβ”€β”€ 🎭 Face Scrapers
β”‚   β”œβ”€β”€ face_scraper.py              # Basic AI face downloader
β”‚   β”œβ”€β”€ advanced_face_scraper.py     # Advanced concurrent face scraper
β”‚   └── test_face_scraper.py         # Face scraper test utility
β”œβ”€β”€ 🌐 Web Scrapers
β”‚   β”œβ”€β”€ web_scraper.py               # Basic web scraping example
β”‚   └── advanced_scraper.py          # Advanced HTML parsing scraper
β”œβ”€β”€ πŸ“‹ Configuration
β”‚   β”œβ”€β”€ requirements.txt             # Project dependencies
β”‚   β”œβ”€β”€ .gitignore                   # Git ignore patterns
β”‚   └── README.md                    # This file
└── πŸ“Š Output (gitignored)
    β”œβ”€β”€ ai_faces/                    # Basic face downloads
    β”œβ”€β”€ ai_faces_collection/         # Advanced organized collection
    β”‚   β”œβ”€β”€ daily_batches/           # Batch-organized images
    β”‚   β”œβ”€β”€ metadata/                # Download metadata
    β”‚   └── duplicates/              # Duplicate detection results
    └── *.json, *.csv               # Scraping results and metadata

πŸ› οΈ Installation & Setup

1. Clone the Repository

git clone https://github.com/yourusername/WebScrapping.git
cd WebScrapping

2. Create Virtual Environment

# Windows
python -m venv .venv
.venv\Scripts\activate

# macOS/Linux
python3 -m venv .venv
source .venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

🎯 Quick Start

Test Face Scraper

python test_face_scraper.py

Download 10 AI Faces (Basic)

python face_scraper.py

Download 50+ Faces (Advanced with Concurrency)

python advanced_face_scraper.py

Basic Web Scraping

python web_scraper.py

Advanced Web Scraping

python advanced_scraper.py

πŸ“š Detailed Usage

🎭 Face Scraping

Basic Face Scraper

from face_scraper import FaceScraper

# Initialize scraper
scraper = FaceScraper(download_folder="my_faces")

# Download single face
scraper.download_single_face("custom_name.jpg")

# Download multiple faces
downloaded_files = scraper.download_multiple_faces(count=20, delay=2)

# Get statistics
stats = scraper.get_download_stats()
print(f"Downloaded {stats['total_files']} images ({stats['total_size_mb']} MB)")

Advanced Face Scraper

from advanced_face_scraper import AdvancedFaceScraper

# Initialize advanced scraper
scraper = AdvancedFaceScraper("large_collection")

# Download with concurrency
results = scraper.download_batch_concurrent(
    count=100,                    # Number of faces
    batch_name="collection_v1",   # Batch identifier
    max_workers=3,                # Concurrent downloads
    delay=1                       # Delay between batches
)

# Get collection summary
summary = scraper.get_collection_summary()

🌐 Web Scraping

Basic Web Scraping

from web_scraper import scrape_basic_info

# Scrape webpage information
info = scrape_basic_info('https://example.com')
print(f"Title: {info['title']}")
print(f"Status: {info['status_code']}")

Advanced Web Scraping

from advanced_scraper import WebScraper

# Initialize scraper
scraper = WebScraper()

# Comprehensive website scraping
results = scraper.scrape_website('https://example.com')

# Save results
scraper.save_to_json(results, 'results.json')
scraper.save_links_to_csv(results['links'], 'links.csv')

βš™οΈ Configuration

Face Scraper Settings

Modify these variables in the scripts:

# Number of faces to download
num_faces = 50

# Delay between downloads (seconds)
delay_seconds = 2

# Concurrent workers (advanced scraper)
max_workers = 3

# Custom folder name
download_folder = "my_custom_folder"

Web Scraper Settings

# Request timeout
timeout = 10

# Custom headers
headers = {
    'User-Agent': 'Your Custom User Agent'
}

# Rate limiting delay
delay = 1

πŸ“Š Output Examples

Face Scraper Output Structure

ai_faces_collection/
β”œβ”€β”€ daily_batches/
β”‚   β”œβ”€β”€ collection_20250628_1430/
β”‚   β”‚   β”œβ”€β”€ face_20250628_143022_a1b2c3d4.jpg
β”‚   β”‚   β”œβ”€β”€ face_20250628_143025_e5f6g7h8.jpg
β”‚   β”‚   └── ... (more faces)
β”œβ”€β”€ metadata/
β”‚   └── collection_20250628_1430_metadata.json
└── duplicates/
    └── dup_a1b2c3d4.jpg

Web Scraper JSON Output

{
  "url": "https://example.com",
  "timestamp": "2025-06-28 14:30:22",
  "content": {
    "title": "Example Domain",
    "headings": [...],
    "paragraphs": [...],
    "lists": [...]
  },
  "links": [...],
  "images": [...]
}

🚦 Best Practices & Ethics

🀝 Respectful Scraping

  • βœ… Rate Limiting: Built-in delays prevent server overload
  • βœ… Proper Headers: Realistic User-Agent strings
  • βœ… Error Handling: Graceful failure and retry logic
  • βœ… Legal Compliance: Respect robots.txt and terms of service

πŸ”’ AI Face Usage

  • βœ… AI-Generated: All faces are AI-created, not real people
  • βœ… Ethical Use: Suitable for research, testing, and development
  • βœ… No Personal Data: No real personal information collected
  • ⚠️ Usage Responsibility: Ensure appropriate use in your projects

πŸ›‘οΈ Troubleshooting

Common Issues

Problem: pip install requests not recognized

# Solution: Use full path or activate virtual environment
.venv\Scripts\pip.exe install requests

Problem: Face scraper not downloading

# Solution: Test connectivity
python test_face_scraper.py

Problem: Permission denied errors

# Solution: Check folder permissions or run as administrator

Problem: Import errors

# Solution: Ensure virtual environment is activated
.venv\Scripts\activate  # Windows
source .venv/bin/activate  # macOS/Linux

πŸ“¦ Dependencies

requests>=2.31.0        # HTTP requests and session management
beautifulsoup4>=4.12.0  # HTML parsing and data extraction
lxml>=4.9.0            # XML/HTML parser backend
Pillow>=10.0.0         # Image processing capabilities

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Disclaimer

This tool is for educational and research purposes. Users are responsible for ensuring compliance with applicable laws and website terms of service. The developers are not responsible for any misuse of this software.

πŸ™ Acknowledgments


Made with ❀️ for the scraping community

About

Implementing the loomis method into a ML model: loominising!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors