Added language detection throught the distant path of AA.#1031
Draft
eephyne wants to merge 4 commits into
Draft
Conversation
- Added option in setting to disable this feature.
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an opt-in feature to the Direct Download source that infers a book's language from the "distant path" (file path in search results) when the language metadata is missing, with corresponding settings, parsing logic, and tests.
Changes:
- New
DIRECT_DOWNLOAD_LANGUAGE_FROM_PATHcheckbox setting (default off). - Logic in
direct_download.pyto extract a distant path from result rows, detect language via bracket/keyed/name/code patterns with alias mapping frombook-languages.json, and apply local language filtering after parsing. - Extensive new tests covering detection, false-positive avoidance, legacy behavior, and search-level local filtering.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| shelfmark/config/settings.py | Adds new CheckboxField for the path-language toggle. |
| shelfmark/release_sources/direct_download.py | Implements distant-path extraction, language inference, alias map, and conditional local filtering in search_books. |
| tests/config/test_download_settings.py | Verifies the new settings field exists with expected default/description. |
| tests/direct_download/test_search_queries.py | Adds tests for distant-path language detection, edge cases, and the new local filtering path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+709
to
+722
| detected_from_path = _detect_language_from_distant_path(distant_path) | ||
|
|
||
| # Temporary visual diagnostics for field mapping and path-language inference. | ||
| if _is_language_from_path_enabled(): | ||
| logger.info( | ||
| "DD lang debug | id=%s | title=%s | raw_lang=%s | detected_from_path=%s | distant_path=%s | cell11=%s | row=%s", | ||
| record_id, | ||
| _short_debug(title), | ||
| _short_debug(language), | ||
| _short_debug(detected_from_path), | ||
| _short_debug(distant_path, limit=260), | ||
| _short_debug(cells[10].get_text(" ", strip=True), limit=140), | ||
| _short_debug(row.get_text(" ", strip=True), limit=260), | ||
| ) |
Comment on lines
+712
to
+713
| if _is_language_from_path_enabled(): | ||
| logger.info( |
| _short_debug(row.get_text(" ", strip=True), limit=260), | ||
| ) | ||
|
|
||
| if _is_language_from_path_enabled() and _is_missing_or_placeholder_language(language): |
Comment on lines
+290
to
+292
|
|
||
| def _extract_distant_path(row: Tag) -> str | None: | ||
| """Extract distant path hints from a direct-download search row.""" |
| _LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?") | ||
| _LANGUAGE_ALIAS_TO_CODE: dict[str, str] | None = None | ||
| _LANGUAGE_PLACEHOLDERS = frozenset({"", "-", "--", "—", "unknown", "unk", "n/a", "na"}) | ||
| _AMBIGUOUS_SHORT_LANGUAGE_CODES = frozenset({"de", "en", "it", "la", "no", "or", "is", "in"}) |
| r"\b(?:bd|lang(?:uage)?)\s*[:._-]?\s*([A-Za-z]{2,3})\b", | ||
| re.IGNORECASE, | ||
| ) | ||
| _LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?") |
Comment on lines
+576
to
+581
| # When path-language inference is enabled, language filtering must happen | ||
| # after row parsing, otherwise source-side lang filters drop rows too early. | ||
| if not (path_language_enabled and requested_langs): | ||
| for value in filters.lang or []: | ||
| if value and value != "all": | ||
| filters_query += f"&lang={quote(value)}" |
Comment on lines
+463
to
+480
| import shelfmark.release_sources.direct_download as dd | ||
|
|
||
| captured_url: dict[str, str] = {} | ||
|
|
||
| original_get = dd.config.get | ||
|
|
||
| def _fake_get(key: str, default=None, user_id=None): | ||
| del user_id | ||
| if key == "DIRECT_DOWNLOAD_LANGUAGE_FROM_PATH": | ||
| return True | ||
| return original_get(key, default) | ||
|
|
||
| monkeypatch.setattr(dd.config, "get", _fake_get) | ||
| monkeypatch.setattr(dd.network, "get_aa_base_url", lambda: "https://mirror.example") | ||
| monkeypatch.setattr(dd.network, "AAMirrorSelector", lambda: object()) | ||
|
|
||
| def _fake_html_get_page(url: str, selector, allow_bypasser_fallback=False): | ||
| del selector, allow_bypasser_fallback |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Comment on lines
+726
to
+729
| # Temporary visual diagnostics for field mapping and path-language inference. | ||
| if _is_language_from_path_enabled(): | ||
| logger.info( | ||
| "DD lang debug | id=%s | title=%s | raw_lang=%s | detected_from_path=%s | distant_path=%s | cell11=%s | row=%s", |
Comment on lines
+224
to
+226
| _LANGUAGE_CODE_TOKEN_PATTERN = re.compile( | ||
| r"(?:^|[\s_./\\\-\[(])([A-Za-z]{2,3})(?=$|[\s_./\\\-)\]])" | ||
| ) |
| r"\b(?:bd|lang(?:uage)?)\s*[:._-]?\s*([A-Za-z]{2,3})\b", | ||
| re.IGNORECASE, | ||
| ) | ||
| _LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?") |
Comment on lines
+266
to
+270
| def _language_alias_to_code() -> dict[str, str]: | ||
| """Build alias->code map from bundled language metadata.""" | ||
| global _LANGUAGE_ALIAS_TO_CODE | ||
| if _LANGUAGE_ALIAS_TO_CODE is not None: | ||
| return _LANGUAGE_ALIAS_TO_CODE |
Comment on lines
+593
to
+596
| if not (path_language_enabled and requested_langs): | ||
| for value in filters.lang or []: | ||
| if value and value != "all": | ||
| filters_query += f"&lang={quote(value)}" |
Comment on lines
+462
to
+465
| def test_search_books_filters_language_locally_when_path_language_enabled(monkeypatch): | ||
| import shelfmark.release_sources.direct_download as dd | ||
|
|
||
| captured_url: dict[str, str] = {} |
| ) | ||
| _LANGUAGE_NAME_TOKEN_PATTERN = re.compile(r"[a-z]{3,}(?:-[a-z0-9]+)?") | ||
| _LANGUAGE_ALIAS_TO_CODE: dict[str, str] | None = None | ||
| _LANGUAGE_PLACEHOLDERS = frozenset({"", "-", "--", "—", "unknown", "unk", "n/a", "na"}) |
…eedback Downgrade temporary per-row language diagnostics from INFO to DEBUG to reduce production log noise. Cache path-language toggle per parsed row and skip distant-path extraction when the feature is disabled. Make language alias cache initialization thread-safe with a lock to avoid race conditions on first load. Reduce language false positives by tightening name-token matching and preferring non-ambiguous strong candidates. Keep server-side language filtering enabled and apply local path-based filtering as an additional refinement. Remove redundant normalized placeholder handling for em-dash language values. Update Direct Download setting description to document language-filter trade-offs. Fix test indentation consistency and update assertions for restored server-side lang query behavior. Add regression coverage for bracket-order ambiguity (e.g., EN marker appearing before FR marker).
…epte les livres sans métadonnées linguistiques et ajuste les filtres de langue pour les fichiers lgli.
Comment on lines
+1443
to
+1452
| CheckboxField( | ||
| key="DIRECT_DOWNLOAD_LANGUAGE_FROM_PATH", | ||
| label="Detect Language From Distant Path", | ||
| description=( | ||
| "When language metadata is missing, parse the distant path and set language " | ||
| "from tags like [BD FR]. Falls back to unknown when not detected. " | ||
| "Note: source-side language filters still apply and may exclude poorly tagged rows." | ||
| ), | ||
| default=False, | ||
| ), |
Comment on lines
+758
to
+769
| # Temporary visual diagnostics for field mapping and path-language inference. | ||
| if path_language_enabled: | ||
| logger.debug( | ||
| "DD lang debug | id=%s | title=%s | raw_lang=%s | detected_from_path=%s | distant_path=%s | cell11=%s | row=%s", | ||
| record_id, | ||
| _short_debug(title), | ||
| _short_debug(language), | ||
| _short_debug(detected_from_path), | ||
| _short_debug(distant_path, limit=260), | ||
| _short_debug(cells[10].get_text(" ", strip=True), limit=140), | ||
| _short_debug(row.get_text(" ", strip=True), limit=260), | ||
| ) |
Comment on lines
+774
to
+779
| logger.debug( | ||
| "DD lang debug resolved | id=%s | final_lang=%s | fallback=%s", | ||
| record_id, | ||
| _short_debug(language), | ||
| "unknown" if detected_from_path is None else "detected", | ||
| ) |
Comment on lines
+323
to
+328
| normalized = re.sub( | ||
| r"\s+\.(epub|mobi|azw3|fb2|djvu|cbz|cbr|pdf|zip|rar|m4b|mp3)\b", | ||
| r".\1", | ||
| normalized, | ||
| flags=re.IGNORECASE, | ||
| ) |
Comment on lines
+280
to
+286
| mapping: dict[str, str] = {} | ||
| data_path = Path(__file__).resolve().parents[2] / "data" / "book-languages.json" | ||
|
|
||
| try: | ||
| raw = json.loads(data_path.read_text(encoding="utf-8")) | ||
| except (OSError, ValueError, TypeError): | ||
| _LANGUAGE_ALIAS_TO_CODE = {} |
Comment on lines
+434
to
+440
| def _book_matches_requested_languages(book_language: str | None, requested: set[str]) -> bool: | ||
| """Return True when a book language matches normalized requested filters. | ||
|
|
||
| A book whose language is unknown (None) passes through: the server-side | ||
| ``&lang=`` filter already constrained the result set, so dropping rows | ||
| that simply lack metadata would hide relevant results. | ||
| """ |
calibrain
added a commit
that referenced
this pull request
Jun 15, 2026
…#1010, #1021, #1025, #1040) (#1066) ## Backport bug fixes from `NemesisHubris/litfinder` Forwards a curated set of bug fixes from [NemesisHubris/litfinder](https://github.com/NemesisHubris/litfinder) — a community fork of this project — that address open issues here. All commits preserve original authorship via `git cherry-pick`; this PR is a backport rather than original work. Each fix has been reviewed locally, lint/format-cleaned to match this repo's existing ruff config, and verified with the test suite. Rebrand strings, license switches, and features have been deliberately excluded. ### Upstream issues addressed - **#999** — Mirror URLs with query params no longer break search requests (strip query string/fragment in `normalize_http_url`) - **#956** — Apprise notifications now respect the configured proxy (proxy env vars injected before dispatch) - **#1025** — rTorrent: separate `RTORRENT_AUDIOBOOK_LABEL` setting, falls back to book label if unset - **#1010** — Stop button in Activity no longer makes the panel disappear (snapshot refresh on cancel) - **#1021** — Anna's Archive slow-download countdown now caps retries instead of looping forever - **#1040** — Empty destination directory cleaned up when write probe fails - **PR #1031** — Language detection from Anna's Archive distant path when listing metadata is missing ### Additional fixes (no open issue but clear bugs) - **fix: Python 2 `except` syntax across 27 files** — `except X, Y:` is a SyntaxError in Python 3 and prevents affected modules from importing at runtime. Mechanical sweep to `except (X, Y):`. - **fix(abb): info hash validation with magnet fallback** — adds SHA-1/SHA-256 hex validation on extracted info hashes; falls back to scanning the full page for a magnet link (e.g. posted in comments) when the table value is malformed. Also extends the exact-phrase fallback to manual queries and defaults the ABB listing language to `en` when missing, preventing valid results from being hidden by the language filter. Includes a small test-fixture fix (`test(abb): use valid hex info hashes in scraper test fixtures`) since the existing fixtures used non-hex placeholders that the new validation correctly rejects. - **fix: Anna's Archive title parser** — handles nested edition spans and filters `lgli` catalog descriptor entries (e.g. "Book/Online Audio") that were polluting search results. ### Deliberately not included - LitFinder rebranding (UI strings, Apprise app ID, logo). The `fix: three upstream bugs` commit (#999/#956/#1025) was cherry-picked with Apprise app-id, description, and logo-URL strings reverted from "LitFinder" back to "Shelfmark"; noted in the commit body. - Features from the LitFinder fork (multi-variant title search, multi-book flat-folder grouping, fuzzy text matching, "Leave in Place" output handler, admin display name, custom-source plugin system). These are larger behavior changes that each warrant their own focused review — happy to send any of them separately if of interest. - LitFinder-specific test environment and CI infrastructure. ### Verification - Backend: **1879 passed**, 96 skipped (1 preexisting failure on `seleniumbase`-dependent test in local venv; runs fine in the standard Docker image with the `browser` extra) - Lint, format, dead-code: all clean against this repo's existing ruff/vulture config - One follow-up cleanup commit (`style: ruff lint and format fixes for ported commits`) brings the cherry-picked code into compliance with this repo's ruff settings — no behavior changes there ### Etiquette / credit Per-commit authorship preserved by cherry-pick. The only edits to the original commits are: - `fix: three upstream bugs` — Apprise rebrand strings reverted to "Shelfmark" (noted in commit body, original author retained as `Co-Authored-By` via cherry-pick) - One follow-up `style:` commit for ruff config alignment Big thanks to [@NemesisHubris](https://github.com/NemesisHubris) for the original work in LitFinder; this PR exists to make sure these fixes reach Shelfmark's wider user base. Happy to revise scope, split into smaller PRs, or split off the Py2 cleanup separately if that's preferable. --------- Co-authored-by: NemesisHubris <155838970+NemesisHubris@users.noreply.github.com> Co-authored-by: CaliBrain <calibrain@l4n.xyz>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is to avoid shelfmark dumping many result due to missing language in AA.
With this option enabled, it will parse the distant path (don’t know how its named) to look for language and set the language accordingly.
I tested it with different request and language and the parsing seem ok to me but it can probably be improved.
The option can be enabled or disabled in the setting Direct Download > Download Source