WCXB: Web Content Extraction Benchmark — 2,008 pages, 7 page types, 1,613 domains. The largest open benchmark for web content extraction, boilerplate removal, and main content detection.
-
Updated
Apr 4, 2026 - Python
WCXB: Web Content Extraction Benchmark — 2,008 pages, 7 page types, 1,613 domains. The largest open benchmark for web content extraction, boilerplate removal, and main content detection.
Python tool for advanced web scraping and site mirroring. It downloads entire websites, including HTML, CSS, JS, images, and other assets, while preserving site structure and updating links for of…
🌐 BeautifulSoup: Effortlessly scrape and parse web data with this powerful Python library! Perfect for developers needing quick and reliable HTML/XML data extraction. Start saving time on your projects today! [MIRROR][UNOFFICIAL]
Add a description, image, and links to the web-content-extraction topic page so that developers can more easily learn about it.
To associate your repository with the web-content-extraction topic, visit your repo's landing page and select "manage topics."