Skip to content

[pull] dev from ArchiveBox:dev#1

Open
pull[bot] wants to merge 4557 commits into
mrbenns:devfrom
ArchiveBox:dev
Open

[pull] dev from ArchiveBox:dev#1
pull[bot] wants to merge 4557 commits into
mrbenns:devfrom
ArchiveBox:dev

Conversation

@pull

@pull pull Bot commented May 21, 2022

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull Bot added ⤵️ pull merge-conflict Resolve conflicts manually labels May 21, 2022
Comment thread archivebox/main.py Outdated
'youtube_dl',
], capture_output=True, text=True, cwd=out_dir).stdout.split('Location: ')[-1].split('\n', 1)[0]
NEW_YOUTUBEDL_BINARY = Path(pkg_path) / 'youtube_dl' / '__main__.py'
os.chmod(NEW_YOUTUBEDL_BINARY, 0o777)

Check failure

Code scanning / CodeQL

Overly permissive file permissions

Overly permissive mask in chmod sets file to world writable.
Comment thread archivebox/core/views.py Outdated
if PUBLIC_INDEX:
return redirect('/public')

return redirect(f'/admin/login/?next={request.path}')

Check warning

Code scanning / CodeQL

URL redirection from remote source

Untrusted URL redirection depends on [a user-provided value](1).
Comment thread archivebox/core/views.py Outdated

def get(self, request, path):
if not request.user.is_authenticated and not PUBLIC_SNAPSHOTS:
return redirect(f'/admin/login/?next={request.path}')

Check warning

Code scanning / CodeQL

URL redirection from remote source

Untrusted URL redirection depends on [a user-provided value](1).
Comment thread archivebox/core/views.py Outdated

# missing trailing slash -> redirect to index
if '/' not in path:
return redirect(f'{path}/index.html')

Check warning

Code scanning / CodeQL

URL redirection from remote source

Untrusted URL redirection depends on [a user-provided value](1).
Comment thread archivebox/core/views.py Outdated
response = super().get(*args, **kwargs)
return response
else:
return redirect(f'/admin/login/?next={self.request.path}')

Check warning

Code scanning / CodeQL

URL redirection from remote source

Untrusted URL redirection depends on [a user-provided value](1).
Comment thread archivebox/core/admin.py Outdated

def add_view(self, request):
if not request.user.is_authenticated:
return redirect(f'/admin/login/?next={request.path}')

Check warning

Code scanning / CodeQL

URL redirection from remote source

Untrusted URL redirection depends on [a user-provided value](1).
pirate and others added 24 commits May 31, 2026 02:24
…l limits, redirect to abx-plugins)

This rewrite (now reapplied on top of the wiki subtree) covers the full
session's work on Configuration.md:

- Add crawl/snapshot limits (CRAWL_MAX_URLS/SIZE/TIMEOUT,
  CRAWL_MAX_CONCURRENT_SNAPSHOTS, SNAPSHOT_MAX_SIZE), DELETE_AFTER,
  PERMISSIONS, PLUGINS/ENABLED_PLUGINS/ACTIVE_PERSONA.
- Add new Database Settings section (SQLITE_* tuning + DATABASE_NAME).
- Add SERVER_SECURITY_MODE deep-dive (4 modes, host-layout table).
- Add Storage path overrides (DATA_DIR, ARCHIVE_DIR, USERS_DIR,
  PERSONAS_DIR, CRAWL_DIR, SNAP_DIR, ALLOW_NO_UNIX_SOCKETS).
- Remove ALLOWED_HOSTS + CSRF_TRUSTED_ORIGINS as user-settable; both
  auto-derived from BASE_URL + SERVER_SECURITY_MODE. Backward-compat
  anchors preserved on BASE_URL with the 0.7.3 -> 0.9 legacy upgrade note.
- Remove the entire Plugin Settings tree (~200 options, 41 subsections);
  replace with prominent redirect to https://archivebox.github.io/abx-plugins/
  and a "shared core options that plugins fall back to" table.
- Add 231 backward-compat <a id="..."></a> anchors so old URLs to plugin
  sections / removed options / multi-option headers all still resolve
  (e.g. #wget_args -> Plugin Configuration section, #public_snapshots ->
  PERMISSIONS, #ssl_enabled -> Plugin Configuration, #admin_username ->
  ADMIN_USERNAME/PASSWORD heading, #dir_output_permissions ->
  OUTPUT_PERMISSIONS, #url_blacklist -> URL_DENYLIST).
- Fix wrong default: PUBLIC_ADD_VIEW is False, not True.
- Drop the 7 TRAFILATURA_OUTPUT_* per-format flags (replaced by single
  TRAFILATURA_OUTPUT_FORMATS in plugin); SSL_ENABLED/SSL_TIMEOUT (wrong
  plugin namespace) — anchors redirected to Plugin Configuration.
- Reframe COOKIES_FILE as low-level escape hatch; personas are the
  preferred auth path.
- Link every named plugin to its specific anchor on the abx-plugins page
  (e.g. WGET_TIMEOUT -> #wget, SONIC_HOST -> #search_backend_sonic).
- Strip implementation-detail mentions (Pydantic, etc.).
- Slim Shell Options to only user-settable (DEBUG, USE_COLOR,
  SHOW_PROGRESS); drop IS_TTY/IN_DOCKER/IN_QEMU.
- Restructure: General -> Server (+LDAP) -> Storage -> Database (new) ->
  Search -> Shell -> Plugin Configuration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ugin selector

ENABLED_PLUGINS and PLUGINS were two near-identical config keys: PLUGINS
was the CLI/per-run whitelist (--plugins flag, runner config), while
ENABLED_PLUGINS was the UI/API "persisted enabled set" — but both ended
up steering the same plugin resolution. Consolidating on PLUGINS as the
single source of truth.

- archivebox/config/common.py: drop the ENABLED_PLUGINS Field entirely
  (no alias, no compat shim — the migration is one-shot).
- archivebox/hooks.py:get_enabled_plugins(): read PLUGINS instead of
  ENABLED_PLUGINS. Function name kept (describes the return value).
- archivebox/templates/core/add.html: admin "Add" form JS now writes to
  PLUGINS; help text updated to reference PLUGINS.

views.py:1302 and runner.py:585 already read/wrote PLUGINS; they're now
consistent with the resolver.

abx-dl is unaffected — it receives selected_plugins as a Python list
argument and never reads either config key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several inaccuracies + over-documentation cleaned up in one pass:

- ONLY_NEW: completely rewrite. The old prose ("ArchiveBox will never
  re-download sites that have already succeeded previously") was carried
  over from 0.7.x and is wrong in 0.9.x — setting ONLY_NEW=False (or
  --no-only-new) explicitly creates a new Snapshot and re-runs every
  extractor. Now describes the actual behavior: skip URL entirely vs.
  create a new Snapshot for it.
- CRAWL_MAX_CONCURRENT_SNAPSHOTS: fix the "each concurrent Snapshot
  launches its own Chrome instance" claim. Chrome is crawl-scoped by
  default (CHROME_ISOLATION="crawl") — concurrent Snapshots share the
  crawl's Chrome via tabs, not separate browser processes.
- BASE_URL: drop the "admin.admin.admin.<host> compounding bug"
  reference. Config docs shouldn't explain legacy bugs.
- Remove derived/runtime-only options that are NOT user-settable:
  ACTIVE_PERSONA (set by persona resolver), CRAWL_DIR/SNAP_DIR (injected
  by orchestrator per-call), DATA_DIR (derived from cwd), ARCHIVE_DIR
  (derived from DATA_DIR/archive), USERS_DIR (derived from ARCHIVE_DIR),
  PERSONAS_DIR (derived from DATA_DIR), LIB_BIN_DIR (tracks LIB_DIR),
  DATABASE_NAME (derived from DATA_DIR/index.sqlite3). Backward-compat
  <a id="..."></a> anchors preserved for all of them above the nearest
  surviving heading so external links still resolve.
- LIB_DIR: fix default path. The doc claimed "<DATA_DIR>/lib/<arch>-<os>"
  but constants.py:117 uses platformdirs.user_config_path("abx") / "lib"
  — the XDG user-config dir, not inside the data folder. Updated to the
  actual default.
- ENABLED_PLUGINS section dropped (option removed in a separate commit);
  anchor redirected to PLUGINS.
- Drop the "Pydantic config" implementation-detail mention in PUID/PGID.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p default to 50

admin_snapshots.py:571 had min(max(50, SNAPSHOTS_PER_PAGE), 500), and
admin_archiveresults.py:501 had min(max(5, SNAPSHOTS_PER_PAGE), 5000).
Both clamps silently overrode the configured value — a documented
default of 40 was inaccessible in the Snapshot admin, and the
ArchiveResult admin also reused the same setting without being mentioned
in the docs.

- Drop both clamps; admin changelists now use SNAPSHOTS_PER_PAGE as-is.
- Bump the default in common.py from 40 to 50 (matches what users were
  actually seeing in the admin under the old floor).
- Add ge=1 validation so non-positive values are rejected at config
  parse time instead of producing broken pagination.
- Update Configuration.md: new default 50, clarify the option drives
  both Snapshot and ArchiveResult admin changelists plus the public
  index, and that it must be >= 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sweep of all prose doc pages to fix references that were stale, wrong,
or pointed at anchors/options that no longer exist in 0.9.x.

Critical (non-functional examples + factual errors):
- All `PUBLIC_SNAPSHOTS=...` examples (Security-Overview, Publishing-
  Your-Archive, Usage) replaced with `PERMISSIONS=public|private`.
- Setting-up-Authentication: drop the "edit CSRF_TRUSTED_ORIGINS in
  archivebox/core/settings.py source" advice (no longer user-settable);
  update auth-permissions list to use PERMISSIONS instead of
  PUBLIC_SNAPSHOTS.
- Security-Overview: SAVE_ARCHIVE_DOT_ORG (with extra underscores)
  was never real; use ARCHIVEDOTORG_ENABLED.
- Docker/Install/Usage: FETCH_TITLE/FETCH_SCREENSHOT/FETCH_PDF/FETCH_DOM
  were never aliases (only FETCH_MEDIA is); replace with real
  <PLUGIN>_ENABLED.
- Troubleshooting: CHROME_BINARY default is `chromium`, not
  `chromium-browser`. Also fixed deprecated `brew cask upgrade
  chromium-browser` -> `brew upgrade --cask chromium`.
- Docker: typo MAX_MEDIA_SIZE -> MEDIA_MAX_SIZE.

Broken Configuration anchors (must be lowercase on GitHub wiki):
- Security-Overview: #FOOTER_INFO / #OUTPUT_PERMISSIONS / #COOKIES_FILE
  -> lowercase.
- Setting-up-Authentication: combined #public_index--public_snapshots--public_add_view
  -> individual #public_index / #public_add_view / #permissions.

Plugin option references now link to abx-plugins:
- CHROME_USER_DATA_DIR / CHROME_BINARY / CHROME_SANDBOX -> /#chrome
- RIPGREP_BINARY -> /#search_backend_ripgrep
- WGET_ENABLED / DOM_ENABLED / SAVE_WGET / SAVE_DOM -> respective anchors
- ARCHIVEDOTORG_ENABLED -> /#archivedotorg
- FAVICON_PROVIDER / FAVICON_ENABLED -> /#favicon
- MEDIA_ENABLED -> /#media

Legacy aliases:
- Scheduled-Archiving: URL_WHITELIST/URL_BLACKLIST -> URL_ALLOWLIST/
  URL_DENYLIST; dropped non-existent `--overwrite` schedule flag.

Dead source links removed:
- Usage: archivebox/main.py + archivebox/config.py (split to cli/ and
  config/common.py).
- Security-Overview: archivebox/extractors/*.py -> plugin anchors.
- Install: dead Configuration#dependency-options and
  Configuration#archive-method-toggles anchors -> abx-plugins reference.

Typo fixes (codespell):
- preferrably -> preferably, necesary -> necessary, Rasberry ->
  Raspberry, sytem -> system, Dissallow -> Disallow, whats -> what's,
  filesytem -> filesystem.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Clean rebuild of docs/apidocs/archivebox/ via autodoc2:
- Removes 19 stale module pages whose source files no longer exist
  (cli_utils, host_utils, schedule_utils — renamed to *_util; actors,
  ideas, debugging, folders, legacy, progress_layout, tests_piping,
  config_tags, personas.runtime/views, orchestrator*, worker, tasks).
- Adds 29 new module pages for code that was added since the previous
  generation but not yet documented.
- Updates 100 existing pages to reflect API surface changes (e.g.
  ENABLED_PLUGINS Field removed, SNAPSHOTS_PER_PAGE clamp removed,
  default bumped to 50, etc.).

156 source modules <-> 156 apidoc files (zero drift). Build clean
under sphinx-build -W --keep-going.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⤵️ pull merge-conflict Resolve conflicts manually

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants