Skip to content

fix: Stop node execution from deleting or moving pre-existing datasets in workspace#41

Open
floo-dck wants to merge 1 commit into
ISG-Siegen:developfrom
leonlenz:fix/workspace-csv-clear
Open

fix: Stop node execution from deleting or moving pre-existing datasets in workspace#41
floo-dck wants to merge 1 commit into
ISG-Siegen:developfrom
leonlenz:fix/workspace-csv-clear

Conversation

@floo-dck
Copy link
Copy Markdown

Summary

This PR introduces a workspace snapshotting mechanism in exec_node right before the interpreter runs. By capturing the existing file state, we ensure that only newly generated artifacts are moved to the node's checkpoint directory, protecting static assets like datasets from being deleted.

Problem

Previously, the code gathered all files present in the workspace directory after execution to clean them up or move them to checkpoints.

If a large dataset (like movielens-100k) was placed inside the workspace for the recommender experiments to use, the tree search runner classified it as a "generated file". Because dataset files do not match the image whitelist (.png, .jpg, etc.), they were automatically wiped out by shutil.rmtree or os.remove when keep_only_relevant_files was active. This completely broke subsequent experiment iterations that relied on the data.

Solution

  • Implemented files_before_exec = set(workspace_dir.rglob("*")) immediately prior to self._interpreter.run(current_code).
  • Filtered the file collection loops for both the workspace root and the working/ directory to strictly process items where item not in files_before_exec.

…rkspace

Snapshots the workspace file state before running code in `exec_node`.
This ensures that pre-existing assets like datasets (e.g., MovieLens)
are not mistakenly identified as newly generated artifacts, which
previously caused them to be moved or permanently deleted by the
file whitelist cleaner.
Copy link
Copy Markdown
Collaborator

@eisenbahnhero eisenbahnhero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not entirely sure whether this error can actually occur. In my opinion, Omnirec should simply reload the datasets. After all, each node is effectively a self-contained process.

Do you maybe have an AutoRecLab run where this error occurred?

We’ve already carried out a few experiments ourselves with keep_only_relevant_files enabled and haven’t actually encountered this problem so far. But that doesn’t mean, of course, that it can’t happen.

@floo-dck
Copy link
Copy Markdown
Author

Thanks for the feedback!

To prove exactly how this failure loop triggers, I have attached my log file and referenced the explicit screenshots from my environment below.

To clarify: in my setup, keep_only_relevant_files is set to False. The issue occurs because I manually place the movielens.csv dataset directly into the out/workspace folder so that the experiments can use it locally from the start (as seen in the screenshot).

Here is the step-by-step breakdown of the failure loop documented in the logs:

  1. Iteration 1 starts: The pipeline runs smoothly at first because the dataset is physically present in the workspace.
  2. Iteration 1 finishes: As seen in the log line [2026/05/27 10:56:29] [INFO] isgsa.treesearch: Moved movielens.csv to checkpoint, exec_node sweeps up the entire workspace. Since keep_only_relevant_files is False, the code hits the fallback block and uses shutil.move() to physically relocate movielens.csv.
  3. The Trap: Because shutil.move physically cuts and relocates the directory, the workspace is left completely stripped of the dataset. Instead, it is now archived inside the checkpoint folder.
  4. Iteration 2 starts: The next node tries to execute. As shown in the log line [2026/05/27 10:57:48] [DEBUG] isgsa.treesearch: ExecutionResult(...) FileNotFoundError: Could not find C:\Users\flori\PycharmProjects\AutoRecLab-group4\out\workspace\movielens.csv, it looks for the dataset file in the workspace and immediately crashes with a FileNotFoundError because the folder was moved away during the previous node's cleanup loop.

Semantic Issue

Furthermore, moving the dataset into checkpoint/<node_id>/generated/ is conceptually incorrect. movielens.csv is a static, pre-placed input asset—it was never "generated" by the LLM or the execution process. Archiving it there pollutes the checkpoint data and makes telemetry highly misleading.

My Log: debug.log

Screenshot 2026-05-27 110408 Screenshot 2026-05-27 105825 Screenshot 2026-05-27 105839

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants