Skip to content

Replace shell-based process runner with open3_safe#9

Open
vandric wants to merge 1 commit intomasterfrom
vandric-use-open3safe
Open

Replace shell-based process runner with open3_safe#9
vandric wants to merge 1 commit intomasterfrom
vandric-use-open3safe

Conversation

@vandric
Copy link
Copy Markdown

@vandric vandric commented Apr 20, 2026

Summary

  • Replaces ExternalProcess#run's bash-pipe + system timeout/gtimeout binary approach with Open3Safe.capture3_safe, which handles stdout/stderr capture, timeout, and RSS-based memory limiting without spawning an intermediate shell
  • Adds max_rss: keyword argument throughout the public API (extract_text_with_timeouts, extract_images_with_timeouts) and propagates it down through TextExtractor, ImageExtractor, and into each run call
  • Converts env parameter from a raw string (e.g. "MAGICK_TMPDIR=... OMP_NUM_THREADS=2") to a proper Hash as required by Open3
  • Removes 2>&1 redirects from all command strings — stderr is now captured separately and merged with stdout after the fact, with the same blank-line and consecutive-duplicate filtering as before (guards against memory bloat from corrupt PDF warning floods — see silverfin/issues/1998)
  • Adds open3_safe gem dependency (Gemfile, Gemfile.lock, docsplit.gemspec)

Test plan

  • Run existing text extraction against a known-good PDF and verify output is unchanged
  • Run image extraction and verify pages are rasterized correctly
  • Trigger a timeout scenario (e.g. with a very low timeout value) and confirm TimeoutError is raised
  • Trigger an OOM scenario (low max_rss: value) and confirm TimeoutError is raised
  • Feed a corrupt PDF that generates repetitive warnings and confirm no memory bloat
  • Verify pdffonts path in contains_text? works correctly end-to-end

🤖 Generated with Claude Code

@vandric vandric force-pushed the vandric-use-open3safe branch 4 times, most recently from c7e7eac to 0a81aca Compare April 27, 2026 07:07
@vandric vandric force-pushed the vandric-use-open3safe branch from 0a81aca to e5b63ee Compare May 6, 2026 07:21
Replaces the bash pipe + system timeout binary approach in ExternalProcess#run
with Open3Safe.capture3_safe, which provides proper stdout/stderr capture,
timeout handling, and RSS-based memory limiting without spawning a shell.

- Add open3_safe gem dependency
- Propagate max_rss: keyword arg through public API and all extractors
- Convert env strings to Hashes for Open3 compatibility
- Remove 2>&1 redirects (stderr now captured separately)
- Duplicate/blank-line filtering preserved in the new implementation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vandric vandric force-pushed the vandric-use-open3safe branch from e5b63ee to 59c73e6 Compare May 6, 2026 09:10
# Combine stdout+stderr (previously unified via 2>&1 in callers), then filter blank
# lines and consecutive duplicates. This avoids memory bloat from corrupt PDFs that
# generate an infinite stream of identical warnings — see silverfin/issues/1998.
output = (result[:stdout] + result[:stderr])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: If AI understands it correctly result[:stderr] isn't a stream but a string already in memory containing the whole error output.
If Ghostscript happens to vomit 500MB+ data to stderr, it will be in Ruby's heap at this point. That is what they wanted to fix in issue #1998.
I guess that's why they added grep -v \"^$\" | uniq to the command so it won't reach Ruby.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants