fix: Stop node execution from deleting or moving pre-existing datasets in workspace by floo-dck · Pull Request #41 · ISG-Siegen/AutoRecLab

floo-dck · 2026-05-26T12:18:37Z

Summary

This PR introduces a workspace snapshotting mechanism in exec_node right before the interpreter runs. By capturing the existing file state, we ensure that only newly generated artifacts are moved to the node's checkpoint directory, protecting static assets like datasets from being deleted.

Problem

Previously, the code gathered all files present in the workspace directory after execution to clean them up or move them to checkpoints.

If a large dataset (like movielens-100k) was placed inside the workspace for the recommender experiments to use, the tree search runner classified it as a "generated file". Because dataset files do not match the image whitelist (.png, .jpg, etc.), they were automatically wiped out by shutil.rmtree or os.remove when keep_only_relevant_files was active. This completely broke subsequent experiment iterations that relied on the data.

Solution

Implemented files_before_exec = set(workspace_dir.rglob("*")) immediately prior to self._interpreter.run(current_code).
Filtered the file collection loops for both the workspace root and the working/ directory to strictly process items where item not in files_before_exec.

…rkspace Snapshots the workspace file state before running code in `exec_node`. This ensures that pre-existing assets like datasets (e.g., MovieLens) are not mistakenly identified as newly generated artifacts, which previously caused them to be moved or permanently deleted by the file whitelist cleaner.

eisenbahnhero

I’m not entirely sure whether this error can actually occur. In my opinion, Omnirec should simply reload the datasets. After all, each node is effectively a self-contained process.

Do you maybe have an AutoRecLab run where this error occurred?

We’ve already carried out a few experiments ourselves with keep_only_relevant_files enabled and haven’t actually encountered this problem so far. But that doesn’t mean, of course, that it can’t happen.

floo-dck · 2026-05-27T09:20:02Z

Thanks for the feedback!

To prove exactly how this failure loop triggers, I have attached my log file and referenced the explicit screenshots from my environment below.

To clarify: in my setup, keep_only_relevant_files is set to False. The issue occurs because I manually place the movielens.csv dataset directly into the out/workspace folder so that the experiments can use it locally from the start (as seen in the screenshot).

Here is the step-by-step breakdown of the failure loop documented in the logs:

Iteration 1 starts: The pipeline runs smoothly at first because the dataset is physically present in the workspace.
Iteration 1 finishes: As seen in the log line [2026/05/27 10:56:29] [INFO] isgsa.treesearch: Moved movielens.csv to checkpoint, exec_node sweeps up the entire workspace. Since keep_only_relevant_files is False, the code hits the fallback block and uses shutil.move() to physically relocate movielens.csv.
The Trap: Because shutil.move physically cuts and relocates the directory, the workspace is left completely stripped of the dataset. Instead, it is now archived inside the checkpoint folder.
Iteration 2 starts: The next node tries to execute. As shown in the log line [2026/05/27 10:57:48] [DEBUG] isgsa.treesearch: ExecutionResult(...) FileNotFoundError: Could not find C:\Users\flori\PycharmProjects\AutoRecLab-group4\out\workspace\movielens.csv, it looks for the dataset file in the workspace and immediately crashes with a FileNotFoundError because the folder was moved away during the previous node's cleanup loop.

Semantic Issue

Furthermore, moving the dataset into checkpoint/<node_id>/generated/ is conceptually incorrect. movielens.csv is a static, pre-placed input asset—it was never "generated" by the LLM or the execution process. Archiving it there pollutes the checkpoint data and makes telemetry highly misleading.

My Log: debug.log

michi774 requested review from eisenbahnhero and michi774 May 26, 2026 14:24

eisenbahnhero reviewed May 27, 2026

View reviewed changes

michi774 force-pushed the develop branch from 706cd93 to e990fa5 Compare May 27, 2026 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Stop node execution from deleting or moving pre-existing datasets in workspace#41

fix: Stop node execution from deleting or moving pre-existing datasets in workspace#41
floo-dck wants to merge 1 commit into
ISG-Siegen:developfrom
leonlenz:fix/workspace-csv-clear

floo-dck commented May 26, 2026

Uh oh!

eisenbahnhero left a comment

Uh oh!

floo-dck commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

floo-dck commented May 26, 2026

Summary

Problem

Solution

Uh oh!

eisenbahnhero left a comment

Choose a reason for hiding this comment

Uh oh!

floo-dck commented May 27, 2026

Semantic Issue

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants