fix(harness): fix zero file sizes in sandbox ls and glob results#1509
fix(harness): fix zero file sizes in sandbox ls and glob results#1509wangdy wants to merge 1 commit into
Conversation
Both ls() and glob() in BaseSandboxFilesystem were constructing FileInfo entries with size 0, so sandbox-backed file listings always reported 0 bytes regardless of the actual file size. Fix ls() to emit tab-separated entries with real file sizes, and fix glob() to return "<size>\t<path>" records so FileInfo.size is populated correctly for matched files. Use GNU file utilities available in the Debian-based sandbox: stat -c%s for ls() and find -printf for glob(). Also add regression tests covering file size parsing for both paths.
|
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
AgentScopeJavaBot
left a comment
There was a problem hiding this comment.
🤖 AI Review
This PR fixes a real bug: ls() and glob() in BaseSandboxFilesystem were always reporting size=0 for files because the constructed shell commands did not emit size information. The fix switches to tab-separated output with stat -c%s (for ls) and find -printf '%s\t%p\n' (for glob), and adds focused unit tests that decouple parsing from sandbox execution. The fix is correct and the tests are well-structured. Two improvements are worth considering before merge: (1) the ls() implementation now forks one stat subprocess per file, which is O(N) forks per listing and noticeably slower than glob()'s single-pass find -printf; consider unifying ls() to use find -mindepth 1 -maxdepth 1 -printf '%y\t%p\t%s\n' for consistency and performance. (2) The ls() parser uses split("\t", 3), which is fragile if a filename contains a literal tab character — glob() uses split("\t", 2) and is robust to that. Otherwise the change is solid.
| String cmd = | ||
| "for f in " | ||
| + escapedPath | ||
| + "/*; do " | ||
| + " if [ -d \"$f\" ]; then echo \"DIR:$f\"; " | ||
| + " elif [ -f \"$f\" ]; then echo \"FILE:$f\"; fi; " | ||
| + "done 2>/dev/null"; | ||
| + "/*; do if [ -d \"$f\" ]; then printf 'DIR\\t%s\\n" | ||
| + "' \"$f\"; elif [ -f \"$f\" ]; then printf 'FILE\\t%s\\t%s\\n" | ||
| + "' \"$f\" \"$(stat -c%s \"$f\" 2>/dev/null || echo 0)\"; fi; done" | ||
| + " 2>/dev/null"; |
There was a problem hiding this comment.
[minor] Performance/consistency: this for f in .../*; do ... stat -c%s "$f" ...; done shells out to stat once per file, so listing a directory with N entries spawns N subprocesses. The glob() path below already uses a single find -printf invocation which is both faster and consistent. Consider rewriting ls() to use the same one-pass approach, e.g.
find <path> -mindepth 1 -maxdepth 1 -printf '%y\t%p\t%s\n' 2>/dev/nullwhere %y yields d/f/l, then map d->DIR and f->FILE in Java. This also removes the awkward two-line concatenation of the printf format string and aligns with the comment in glob() about single-pass O(1)-per-file metadata reads.
| entries.add(FileInfo.ofDir(line.substring(4), "")); | ||
| } else if (line.startsWith("FILE:")) { | ||
| entries.add(FileInfo.ofFile(line.substring(5), 0, "")); | ||
| String[] parts = line.split("\t", 3); |
There was a problem hiding this comment.
[minor] Robustness: line.split("\t", 3) is fragile if a filename contains a literal tab character. With a tab in the path, parts[1] will be the prefix of the path and parts[2] will be <rest-of-path>\t<size>, which fails Long.parseLong and silently yields size=0 (the very bug this PR is fixing). The glob() parser uses split("\t", 2) which is tab-safe because the path is the trailing field. If you adopt the find -printf suggestion above and put %s before %p ('%y\t%s\t%p\n'), you can keep split("\t", 3) safely with the path as the trailing field.
| String[] parts = line.split("\t", 2); | ||
| if (parts.length == 2) { | ||
| // find -printf '%s\t%p\n': parts[0]=size, parts[1]=path | ||
| entries.add(FileInfo.ofFile(parts[1].trim(), parseFileSize(parts[0]), "")); |
There was a problem hiding this comment.
[nit] parts[1].trim() strips legitimate leading/trailing whitespace from filenames (whitespace is valid in POSIX paths). Since the field is delimited by tab and find -printf '%p' does not pad, the trim is unnecessary and can corrupt paths. Also note the inconsistency: ls() does not trim its path field. Suggest dropping .trim() here.
| private static long parseFileSize(String raw) { | ||
| try { | ||
| return Long.parseLong(raw.trim()); | ||
| } catch (NumberFormatException e) { | ||
| return 0L; |
There was a problem hiding this comment.
[nit] parseFileSize silently swallows NumberFormatException and returns 0, which is the same value used to indicate "size unknown". If the sandbox ever returns an unexpected format, callers will see size=0 with no signal. Consider logging at WARN/DEBUG so this failure mode is diagnosable in production. Non-blocking.
| @Test | ||
| void ls_parsesFileSizesFromSandboxOutput() { | ||
| TestSandboxFilesystem filesystem = new TestSandboxFilesystem(); | ||
| filesystem.nextExecuteResponse = | ||
| new ExecuteResponse( | ||
| "DIR\t/workspace/session/uploads/subdir\n" | ||
| + "FILE\t/workspace/session/uploads/a.pdf\t1234\n", | ||
| 0, | ||
| false); | ||
|
|
||
| var result = filesystem.ls(RuntimeContext.empty(), "/workspace/session/uploads"); | ||
|
|
||
| assertFalse(result.entries().isEmpty()); | ||
| assertEquals(2, result.entries().size()); | ||
| assertEquals("/workspace/session/uploads/subdir", result.entries().get(0).path()); | ||
| assertEquals("/workspace/session/uploads/a.pdf", result.entries().get(1).path()); | ||
| assertEquals(1234L, result.entries().get(1).size()); | ||
| } | ||
|
|
||
| @Test | ||
| void glob_parsesFileSizesFromSandboxOutput() { | ||
| TestSandboxFilesystem filesystem = new TestSandboxFilesystem(); | ||
| filesystem.nextExecuteResponse = | ||
| new ExecuteResponse( | ||
| "128\t/workspace/session/uploads/a.pdf\n" | ||
| + "2048\t/workspace/session/uploads/b.xlsx\n", | ||
| 0, | ||
| false); | ||
|
|
||
| var result = filesystem.glob(RuntimeContext.empty(), "*.pdf", "/workspace/session/uploads"); | ||
|
|
||
| assertEquals(2, result.matches().size()); | ||
| assertEquals("/workspace/session/uploads/a.pdf", result.matches().get(0).path()); | ||
| assertEquals(128L, result.matches().get(0).size()); | ||
| assertEquals("/workspace/session/uploads/b.xlsx", result.matches().get(1).path()); | ||
| assertEquals(2048L, result.matches().get(1).size()); | ||
| } |
There was a problem hiding this comment.
[praise] Nice test design: subclassing BaseSandboxFilesystem with a stub execute cleanly isolates the parsing logic from real sandbox I/O. This pattern keeps the regression coverage fast and deterministic and is worth reusing for future sandbox-shell parsing changes.
Problem
Both
ls()andglob()inBaseSandboxFilesystemwere constructingFileInfoentries with size0, so sandbox-backed file listings alwaysreported
0 bytesregardless of the actual file size.Fix
ls()to emit tab-separated entries with real file sizesglob()to return"<size>\t<path>"records soFileInfo.sizeis populated correctly
stat -c%sforls()andfind -printfforglob()Testing
mvn -pl agentscope-harness -Dtest=BaseSandboxFilesystemTest test