Skip to content

amafjarkasi/PageScribe

Repository files navigation

🚀 PageScribe: The Ultimate Content Intelligence Suite 🧠

Transforming the Web into Structured, Actionable Insights.


🌟 Presentation Overview

  1. The Vision 🎯
  2. Core Capabilities
  3. Advanced Content Intelligence 🧠
  4. Professional Data Crawling 🕷️
  5. Dynamic Interaction 🖱️
  6. Security & Reliability 🔒
  7. Developer Ecosystem 🛠️

1. 🎯 The Vision

PageScribe is not just a scraper; it's a Content Intelligence Engine. In an era of information overload, PageScribe helps you distill noise into knowledge, providing high-quality summaries, deep metrics, and structured data at the click of a button.


2. ⚡ Core Capabilities

  • 📄 Clean Extraction: Powered by Mozilla's Readability to strip ads, banners, and clutter.
  • 💾 Smart Multi-Format Export: Save content as clean Markdown (.md) with automatic YAML Frontmatter or sanitized HTML (.html).
  • 🖼️ Visual Preview: Instant in-extension preview of your extracted content.
  • 📊 Live Statistics: Real-time word count, character count, and estimated reading time.

3. 🧠 Advanced Content Intelligence

  • 📝 Intelligent Summarization: Uses a weighted TF-IDF scoring model with position bias to identify truly salient sentences.
  • 🎭 Sentiment Analysis: Instantly detect the emotional tone (Positive, Neutral, Negative) of any page.
  • 📖 Readability Scoring: Comprehensive Flesch-Kincaid metrics to assess content complexity.
  • 🔑 Scored Keywords: Keywords are ranked by frequency and relevance, not just listed.
  • 🔍 Entity Extraction: Automatically find Emails, URLs, Phone Numbers, and Dates.
  • 🌐 Language Detection: Automatic identification of English, Spanish, French, German, and Italian.
  • 🏷️ Metadata Harvesting: Deep parsing of Open Graph, Twitter Cards, and Schema.org tags.

4. 🕷️ Professional Data Crawling

  • ⚙️ Advanced Configuration: Control crawl depth, maximum pages, and rate limits (delay).
  • 🚫 Smart Filtering: Use Regex Exclusion Patterns to skip unwanted paths.
  • 🤖 Auto-Pilot Mode: Enable Auto-Summarize to build a knowledge base while you sleep.
  • 📈 Crawl Analytics: Generates aggregate statistics (Total words, language distribution, avg reading time).
  • 📁 Bulk Export: Download your entire crawl as a structured JSON or a flattened CSV for Excel/Data analysis.

5. 🖱️ Dynamic Interaction

  • 📜 Auto-Scroll: Automatically scrolls through lazy-loaded content before extraction to ensure no data is missed.
  • 🔎 History Management: Full-text search and type-based filtering of your past activity.
  • 📤 History Export: Export your entire analysis history to JSON or CSV for external processing.

6. 🔒 Security & Reliability

  • 🛡️ XSS Shield: Every byte of scraped content is sanitized through a detached DOM engine in the content script.
  • 🔏 Strict Escaping: Metadata is strictly escaped to prevent injection in exported files.
  • 🧪 Test-Driven: Backed by an extensive suite of Bun-powered unit tests ensuring algorithmic correctness.
  • 🔧 Manifest V3: Fully compliant with the latest Chrome Extension security standards.

7. 🛠️ Developer Ecosystem

  1. Clone: git clone <repository-url>
  2. Install: bun install
  3. Test: bun test
  4. Build: bun run build
  5. Deploy: Load the .output/chrome-mv3 directory into Chrome.

PageScribe: Because knowledge is power, but focus is everything.

About

PageScribe is a powerful Chrome extension for web content extraction and site crawling. It allows you to save the main content of any webpage as a clean, readable Markdown file, extract keywords, and crawl an entire website to a specified depth.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors