forked from tim-vandecasteele/docsplit
-
Notifications
You must be signed in to change notification settings - Fork 1
Replace shell-based process runner with open3_safe #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
vandric
wants to merge
1
commit into
master
Choose a base branch
from
vandric-use-open3safe
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,2 +1,3 @@ | ||
| *.gem | ||
| .DS_Store | ||
| .idea |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| # CLAUDE.md | ||
|
|
||
| This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. | ||
|
|
||
| ## Commands | ||
|
|
||
| ```bash | ||
| # Install dependencies (open3_safe is fetched from GitHub at a pinned commit ref — see Gemfile) | ||
| bundle install | ||
|
|
||
| # Run all tests | ||
| bundle exec rake test | ||
|
|
||
| # Run a single test file | ||
| bundle exec ruby test/unit/test_extract_text.rb | ||
|
|
||
| # Build and install the gem locally | ||
| bundle exec rake gem:install | ||
| ``` | ||
|
|
||
| ## External Dependencies | ||
|
|
||
| The following system tools must be installed for full functionality: | ||
| - `gm` (GraphicsMagick) — image extraction and OCR pre-processing | ||
| - `pdftotext`, `pdfinfo`, `pdftk` — text extraction and PDF metadata | ||
| - `tesseract` — OCR (with optional `osd` language pack for orientation detection) | ||
| - `java` + JODConverter (vendored in `vendor/`) — non-PDF document conversion | ||
| - `open3_safe` gem — pinned to a specific GitHub commit ref in `Gemfile`; `Gemfile.lock` must be regenerated after changing the ref | ||
|
|
||
| ## Architecture | ||
|
|
||
| `lib/docsplit.rb` is the public API entry point. It defines the `Docsplit` module, checks `PATH` for dependencies at load time, and delegates to extractor classes. | ||
|
|
||
| **Extractor classes** (`lib/docsplit/`): | ||
| - `TextExtractor` — extracts text via `pdftotext`, falls back to Tesseract OCR for pages below `MIN_TEXT_PER_PAGE` (100 bytes) | ||
| - `ImageExtractor` — rasterizes PDF pages via GraphicsMagick (`gm convert`/`gm mogrify`) | ||
| - `PdfExtractor` — converts non-PDF documents to PDF using LibreOffice or JODConverter (Java) | ||
| - `InfoExtractor` — parses `pdfinfo` output for metadata | ||
| - `PageExtractor` — bursts PDFs into single-page PDFs via `pdftk`/`pdftailor` | ||
|
|
||
| **`ExternalProcess` module** (`external_process.rb`) is mixed into extractor classes. Its `run` method wraps `Open3Safe.capture3_safe` to execute subprocesses with: | ||
| - timeout (SIGTERM → SIGKILL after 5s) | ||
| - optional RSS memory limit via `max_rss:` | ||
| - stdout+stderr merged, blank lines and consecutive duplicate lines filtered (guards against memory bloat from corrupt PDFs — silverfin/issues/1998) | ||
|
|
||
| **Timeout-aware public API**: `extract_text_with_timeouts` and `extract_images_with_timeouts` accept `timeout` (overall) and `item_timeout` (per page/file); `extract_pdf_with_timeout` accepts only `timeout`. RSS caps are not part of the public API — each extractor hardcodes its own `MAX_RSS` constant (`TextExtractor`/`ImageExtractor`: 512 MiB, `TextExtractor::TESSERACT_MAX_RSS`: 1 GiB, `PdfExtractor`: 2 GiB) and passes it into `run(..., max_rss:)` internally. The plain `extract_*` variants have no timeouts. | ||
|
|
||
| ## Test Structure | ||
|
|
||
| Tests live in `test/unit/`, use Minitest, and write output to `test/output/` (cleaned up in `teardown`). Fixtures are in `test/fixtures/` — a mix of PDFs, Office docs, and edge-case files (encrypted, unicode, spaces/quotes in filenames). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| source 'https://rubygems.org' | ||
|
|
||
| gemspec | ||
|
|
||
| gem "open3_safe", github: "Silverfin-Engineering/open3_safe", ref: "5badbe14e94bd01d4420fc13ee9a1f129d78b182" | ||
| gem "minitest", "~> 6.0" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,58 @@ | ||
| GIT | ||
| remote: https://github.com/Silverfin-Engineering/open3_safe.git | ||
| revision: 5badbe14e94bd01d4420fc13ee9a1f129d78b182 | ||
| ref: 5badbe14e94bd01d4420fc13ee9a1f129d78b182 | ||
| specs: | ||
| open3_safe (0.1.0) | ||
| get_process_mem | ||
|
|
||
| PATH | ||
| remote: . | ||
| specs: | ||
| docsplit (0.7.6) | ||
| open3_safe | ||
|
|
||
| GEM | ||
| remote: https://rubygems.org/ | ||
| specs: | ||
| bigdecimal (4.1.1) | ||
| drb (2.2.3) | ||
| ffi (1.17.4) | ||
| ffi (1.17.4-aarch64-linux-gnu) | ||
| ffi (1.17.4-aarch64-linux-musl) | ||
| ffi (1.17.4-arm-linux-gnu) | ||
| ffi (1.17.4-arm-linux-musl) | ||
| ffi (1.17.4-arm64-darwin) | ||
| ffi (1.17.4-x86-linux-gnu) | ||
| ffi (1.17.4-x86-linux-musl) | ||
| ffi (1.17.4-x86_64-darwin) | ||
| ffi (1.17.4-x86_64-linux-gnu) | ||
| ffi (1.17.4-x86_64-linux-musl) | ||
| get_process_mem (1.0.0) | ||
| bigdecimal (>= 2.0) | ||
| ffi (~> 1.0) | ||
| minitest (6.0.4) | ||
| drb (~> 2.0) | ||
| prism (~> 1.5) | ||
| prism (1.9.0) | ||
|
|
||
| PLATFORMS | ||
| aarch64-linux-gnu | ||
| aarch64-linux-musl | ||
| arm-linux-gnu | ||
| arm-linux-musl | ||
| arm64-darwin | ||
| ruby | ||
| x86-linux-gnu | ||
| x86-linux-musl | ||
| x86_64-darwin | ||
| x86_64-linux-gnu | ||
| x86_64-linux-musl | ||
|
|
||
| DEPENDENCIES | ||
| docsplit! | ||
| minitest (~> 6.0) | ||
| open3_safe! | ||
|
|
||
| BUNDLED WITH | ||
| 2.6.6 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,35 +1,46 @@ | ||
| require 'shellwords' | ||
| require 'open3_safe' | ||
|
|
||
| module Docsplit | ||
| module ExternalProcess | ||
| extend self | ||
|
|
||
| # Seconds to wait after SIGTERM before escalating to SIGKILL. | ||
| KILL_AFTER = 5 | ||
|
|
||
| # Default RSS byte cap applied when no max_rss is supplied (or nil is passed) — 512 MiB. | ||
| DEFAULT_MAX_RSS = 512 * 1024 * 1024 | ||
|
|
||
| # Run an external process and raise an exception if it fails. | ||
| def run(command, env = "", timeout = nil) | ||
| # If a corrupt PDF is parsed, it generates an infinite amount of identical warnings (with blank lines in between). | ||
| # By filtering these we avoid memory bloat when the executing process tries to capture stdout. The timeout makes | ||
| # sure we exit at some point. | ||
| # | ||
| # - See https://github.com/GetSilverfin/silverfin/issues/1998 | ||
| # - Add timeout so a stuck process doesn't block our Ruby process forever | ||
| # - Remove blank lines | ||
| # - Remove duplicate lines | ||
| run_command = "#{env} #{timeout_prefix(timeout)} #{command} | grep -v \"^$\" | uniq" | ||
|
|
||
| # - Run through bash so we can use PIPESTATUS | ||
| # - Use PIPESTATUS to return the exit status of #{command} instead of `uniq` | ||
| result = `bash -c '#{run_command}; exit ${PIPESTATUS[0]}'`.chomp | ||
| exit_code = $?.exitstatus | ||
|
|
||
| raise TimeoutError, run_command if exit_code == 137 | ||
| raise ExtractionFailed, result if exit_code != 0 | ||
|
|
||
| result | ||
| end | ||
| # | ||
| # command - shell command string (may include 2>&1, which is stripped) | ||
| # env - Hash of extra environment variables (default: {}) | ||
| # timeout - seconds before the process is sent SIGTERM (nil = no timeout) | ||
| # max_rss: - RSS byte threshold; defaults to DEFAULT_MAX_RSS | ||
| def run(command, env = {}, timeout = nil, max_rss: DEFAULT_MAX_RSS) | ||
| # Strip 2>&1 redirects — stderr is captured separately by Open3Safe. | ||
| cmd = Shellwords.split(command.gsub(/\s*2>&1\s*/, ' ').strip) | ||
|
|
||
| def timeout_prefix(timeout) | ||
| timeout ? "#{timeout_bin} #{timeout}" : "" | ||
| end | ||
| opts = { signal: :TERM, kill_after: KILL_AFTER, max_rss: max_rss } | ||
| opts[:timeout] = timeout if timeout | ||
|
|
||
| args = env.empty? ? [*cmd, opts] : [env, *cmd, opts] | ||
| result = Open3Safe.capture3_safe(*args) | ||
|
|
||
| # Combine stdout+stderr (previously unified via 2>&1 in callers), then filter blank | ||
| # lines and consecutive duplicates. This avoids memory bloat from corrupt PDFs that | ||
| # generate an infinite stream of identical warnings — see silverfin/issues/1998. | ||
| output = (result[:stdout] + result[:stderr]) | ||
| .lines | ||
| .reject { |l| l.chomp.empty? } | ||
| .each_with_object([]) { |l, acc| acc << l unless acc.last == l } | ||
| .join | ||
| .chomp | ||
|
|
||
| raise TimeoutError, command if result[:timeout] || result[:oom_killed] | ||
| raise ExtractionFailed, output if result[:status].exitstatus != 0 | ||
|
|
||
| def timeout_bin | ||
| # gtimeout on Mac | ||
| `which timeout` != "" ? "timeout --signal=KILL" : "gtimeout --signal=KILL" | ||
| output | ||
| end | ||
| end | ||
| end | ||
| end | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: If AI understands it correctly
result[:stderr]isn't a stream but a string already in memory containing the whole error output.If Ghostscript happens to vomit 500MB+ data to stderr, it will be in Ruby's heap at this point. That is what they wanted to fix in issue #1998.
I guess that's why they added
grep -v \"^$\" | uniqto the command so it won't reach Ruby.