Skip to content

feat(api): extract PDF hyperlinks and URLs into document metadata#512

Merged
param20h merged 1 commit into
param20h:devfrom
nancysangani:feat/document-metadata-url-extraction
Jun 7, 2026
Merged

feat(api): extract PDF hyperlinks and URLs into document metadata#512
param20h merged 1 commit into
param20h:devfrom
nancysangani:feat/document-metadata-url-extraction

Conversation

@nancysangani
Copy link
Copy Markdown
Contributor

🔗 Related Issue

Closes #441


📝 What does this PR do?

Adds automated PDF link and URL extraction during document ingestion and
exposes the results in the document metadata API.

backend/app/rag/url_extractor.py (new):

  • extract_urls_from_pdf() — two-pass extraction per page using PyMuPDF
    (already in requirements.txt): link annotations (page.get_links())
    first, then plain-text regex fallback for inline URLs not marked as
    hyperlinks. Deduplicates while preserving first-seen order. Returns []
    safely for non-PDF files or any extraction error.

backend/migrate_add_extracted_urls.py (new):

  • Standalone migration script consistent with the project's existing
    migrate_add_role.py / migrate_add_google_refresh_token.py pattern.
  • Adds extracted_urls TEXT DEFAULT NULL to the documents table.
  • Safe to re-run — skips if the column already exists.

backend/app/models.py:

  • Adds extracted_urls = Column(Text, nullable=True) to Document.

backend/app/schemas.py:

  • Adds extracted_urls: Optional[List[str]] = None to DocumentResponse.

backend/app/services/document_ingestion.py:

  • Adds a URL extraction pass after the summary step. Calls
    extract_urls_from_pdf(), JSON-serialises the result, and stores it in
    doc.extracted_urls. Wrapped in try/except — a failure never blocks
    ingestion.

backend/app/routes/documents.py:

  • Adds _deserialize_doc() helper that parses the JSON string from the DB
    back into List[str] before returning DocumentResponse. Applied to
    get_document, list_documents, and rename_document.

🗂️ Type of Change

  • ✨ New feature

🧪 How was this tested?

  • Ran python migrate_add_extracted_urls.py — column added successfully
  • Ran the backend locally (uvicorn app.main:app --reload)
  • Uploaded a PDF containing hyperlinks; confirmed extracted_urls field
    in GET /api/v1/documents/{id} response contains correct URLs
  • Uploaded a PDF with no links; confirmed extracted_urls is null
  • Uploaded a non-PDF (DOCX/TXT); confirmed extraction is skipped and
    extracted_urls is null
  • Confirmed existing text/table/image ingestion is unaffected

✅ Self-Review Checklist

  • My branch is based on dev, not main
  • I have not added any secrets / API keys
  • I have not modified main branch or any HuggingFace deployment config
  • My code follows the existing style (no unnecessary formatting changes)
  • I have updated relevant docs / comments if needed

@param20h param20h merged commit df076b8 into param20h:dev Jun 7, 2026
7 checks passed
@github-actions github-actions Bot added gssoc GirlScript Summer of Code 2026 issue/PR gssoc:approved Approved for GSSoC base points (+50 pts) level:intermediate +35 pts mentor:param20h Mentor for this PR type:backend Backend API labels Jun 7, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 7, 2026

🎉 Congratulations on getting your Pull Request merged! 🎉

Thank you for contributing to PDF-Assistant-RAG as part of GSSoC '26! 🚀

Keep up the great work! ✨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gssoc:approved Approved for GSSoC base points (+50 pts) gssoc GirlScript Summer of Code 2026 issue/PR level:intermediate +35 pts mentor:param20h Mentor for this PR type:backend Backend API

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(api): Implement automated PDF link and URL extraction in document metadata

2 participants