web-content-extraction

Here are 3 public repositories matching this topic...

Murrough-Foley / web-content-extraction-benchmark

WCXB: Web Content Extraction Benchmark — 2,008 pages, 7 page types, 1,613 domains. The largest open benchmark for web content extraction, boilerplate removal, and main content detection.

nlp benchmark text-extraction dataset web-scraping html-parsing content-extraction boilerplate-removal web-content-extraction

Updated Apr 4, 2026
Python

ONEdart / SiteMirror

Star

Python tool for advanced web scraping and site mirroring. It downloads entire websites, including HTML, CSS, JS, images, and other assets, while preserving site structure and updating links for of…

web-scraping web-archiving html-scraper python-automation website-cloning static-site-mirror web-content-extraction

Updated May 22, 2026
Python

🌐 BeautifulSoup: Effortlessly scrape and parse web data with this powerful Python library! Perfect for developers needing quick and reliable HTML/XML data extraction. Start saving time on your projects today! [MIRROR][UNOFFICIAL]

python data-mining mirror web-crawler python3 unofficial web-scraping xpath data-extraction html-parsing css-selectors web-automation mirrored-repository unofficial-mirror dynamic-web-scraping api-scraping web-content-extraction

Updated Mar 1, 2026
Python

Improve this page

Add a description, image, and links to the web-content-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the web-content-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly