Transforming the Web into Structured, Actionable Insights.
- The Vision 🎯
- Core Capabilities ⚡
- Advanced Content Intelligence 🧠
- Professional Data Crawling 🕷️
- Dynamic Interaction 🖱️
- Security & Reliability 🔒
- Developer Ecosystem 🛠️
PageScribe is not just a scraper; it's a Content Intelligence Engine. In an era of information overload, PageScribe helps you distill noise into knowledge, providing high-quality summaries, deep metrics, and structured data at the click of a button.
- 📄 Clean Extraction: Powered by Mozilla's Readability to strip ads, banners, and clutter.
- 💾 Smart Multi-Format Export: Save content as clean Markdown (.md) with automatic YAML Frontmatter or sanitized HTML (.html).
- 🖼️ Visual Preview: Instant in-extension preview of your extracted content.
- 📊 Live Statistics: Real-time word count, character count, and estimated reading time.
- 📝 Intelligent Summarization: Uses a weighted TF-IDF scoring model with position bias to identify truly salient sentences.
- 🎭 Sentiment Analysis: Instantly detect the emotional tone (Positive, Neutral, Negative) of any page.
- 📖 Readability Scoring: Comprehensive Flesch-Kincaid metrics to assess content complexity.
- 🔑 Scored Keywords: Keywords are ranked by frequency and relevance, not just listed.
- 🔍 Entity Extraction: Automatically find Emails, URLs, Phone Numbers, and Dates.
- 🌐 Language Detection: Automatic identification of English, Spanish, French, German, and Italian.
- 🏷️ Metadata Harvesting: Deep parsing of Open Graph, Twitter Cards, and Schema.org tags.
- ⚙️ Advanced Configuration: Control crawl depth, maximum pages, and rate limits (delay).
- 🚫 Smart Filtering: Use Regex Exclusion Patterns to skip unwanted paths.
- 🤖 Auto-Pilot Mode: Enable Auto-Summarize to build a knowledge base while you sleep.
- 📈 Crawl Analytics: Generates aggregate statistics (Total words, language distribution, avg reading time).
- 📁 Bulk Export: Download your entire crawl as a structured JSON or a flattened CSV for Excel/Data analysis.
- 📜 Auto-Scroll: Automatically scrolls through lazy-loaded content before extraction to ensure no data is missed.
- 🔎 History Management: Full-text search and type-based filtering of your past activity.
- 📤 History Export: Export your entire analysis history to JSON or CSV for external processing.
- 🛡️ XSS Shield: Every byte of scraped content is sanitized through a detached DOM engine in the content script.
- 🔏 Strict Escaping: Metadata is strictly escaped to prevent injection in exported files.
- 🧪 Test-Driven: Backed by an extensive suite of Bun-powered unit tests ensuring algorithmic correctness.
- 🔧 Manifest V3: Fully compliant with the latest Chrome Extension security standards.
- Clone:
git clone <repository-url> - Install:
bun install - Test:
bun test - Build:
bun run build - Deploy: Load the
.output/chrome-mv3directory into Chrome.
PageScribe: Because knowledge is power, but focus is everything. ✨