feat(api): extract PDF hyperlinks and URLs into document metadata by nancysangani · Pull Request #512 · param20h/PDF-Assistant-RAG

nancysangani · 2026-06-07T12:02:39Z

🔗 Related Issue

Closes #441

📝 What does this PR do?

Adds automated PDF link and URL extraction during document ingestion and
exposes the results in the document metadata API.

backend/app/rag/url_extractor.py (new):

extract_urls_from_pdf() — two-pass extraction per page using PyMuPDF
(already in requirements.txt): link annotations (page.get_links())
first, then plain-text regex fallback for inline URLs not marked as
hyperlinks. Deduplicates while preserving first-seen order. Returns []
safely for non-PDF files or any extraction error.

backend/migrate_add_extracted_urls.py (new):

Standalone migration script consistent with the project's existing
migrate_add_role.py / migrate_add_google_refresh_token.py pattern.
Adds extracted_urls TEXT DEFAULT NULL to the documents table.
Safe to re-run — skips if the column already exists.

backend/app/models.py:

Adds extracted_urls = Column(Text, nullable=True) to Document.

backend/app/schemas.py:

Adds extracted_urls: Optional[List[str]] = None to DocumentResponse.

backend/app/services/document_ingestion.py:

Adds a URL extraction pass after the summary step. Calls
extract_urls_from_pdf(), JSON-serialises the result, and stores it in
doc.extracted_urls. Wrapped in try/except — a failure never blocks
ingestion.

backend/app/routes/documents.py:

Adds _deserialize_doc() helper that parses the JSON string from the DB
back into List[str] before returning DocumentResponse. Applied to
get_document, list_documents, and rename_document.

🗂️ Type of Change

✨ New feature

🧪 How was this tested?

Ran python migrate_add_extracted_urls.py — column added successfully
Ran the backend locally (uvicorn app.main:app --reload)
Uploaded a PDF containing hyperlinks; confirmed extracted_urls field
in GET /api/v1/documents/{id} response contains correct URLs
Uploaded a PDF with no links; confirmed extracted_urls is null
Uploaded a non-PDF (DOCX/TXT); confirmed extraction is skipped and
extracted_urls is null
Confirmed existing text/table/image ingestion is unaffected

✅ Self-Review Checklist

My branch is based on dev, not main
I have not added any secrets / API keys
I have not modified main branch or any HuggingFace deployment config
My code follows the existing style (no unnecessary formatting changes)
I have updated relevant docs / comments if needed

github-actions · 2026-06-07T17:24:40Z

🎉 Congratulations on getting your Pull Request merged! 🎉

Thank you for contributing to PDF-Assistant-RAG as part of GSSoC '26! 🚀

Keep up the great work! ✨

feat(api): extract PDF hyperlinks and URLs into document metadata

3d05b08

nancysangani requested a review from param20h as a code owner June 7, 2026 12:02

nancysangani mentioned this pull request Jun 7, 2026

feat(api): Implement automated PDF link and URL extraction in document metadata #441

Closed

param20h approved these changes Jun 7, 2026

View reviewed changes

param20h merged commit df076b8 into param20h:dev Jun 7, 2026
7 checks passed

github-actions Bot added gssoc GirlScript Summer of Code 2026 issue/PR gssoc:approved Approved for GSSoC base points (+50 pts) level:intermediate +35 pts mentor:param20h Mentor for this PR type:backend Backend API labels Jun 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): extract PDF hyperlinks and URLs into document metadata#512

feat(api): extract PDF hyperlinks and URLs into document metadata#512
param20h merged 1 commit into
param20h:devfrom
nancysangani:feat/document-metadata-url-extraction

nancysangani commented Jun 7, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nancysangani commented Jun 7, 2026

🔗 Related Issue

📝 What does this PR do?

🗂️ Type of Change

🧪 How was this tested?

✅ Self-Review Checklist

Uh oh!

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants