Skip to content

feat: inject workspace file context into LLM prompts for data-aware code generation#53

Open
TristanSchneider-dev wants to merge 1 commit into
ISG-Siegen:developfrom
leonlenz:feat/workspace_context
Open

feat: inject workspace file context into LLM prompts for data-aware code generation#53
TristanSchneider-dev wants to merge 1 commit into
ISG-Siegen:developfrom
leonlenz:feat/workspace_context

Conversation

@TristanSchneider-dev
Copy link
Copy Markdown

🚀 Description

This PR introduces a new utility module utils/workspace_context.py that dynamically scans the local execution workspace (out/workspace) for user-provided data files (such as .csv, .tsv, .parquet, .json, .jsonl). It automatically formats and injects their metadata directly into all LLM interaction prompts.

By giving the model immediate visibility into exactly what custom files are available and how to reference them, this enhancement prevents the planner from guessing paths or hallucinating filenames.

🛡️ Edge Cases Handled

  • Empty/Missing Workspace: Features a graceful fallback that prints (No custom data files found.) or Not yet created instead of crashing.
  • Noise Filtering: Internal pipeline execution or log files (e.g., save.pkl, runfile.py, out.log) are automatically filtered out and ignored.
  • Environment Agnostic: Paths are resolved robustly across different environments, whether running inside Docker, via local relative execution paths, or using environment variables (/app/..., ./out/..., $ARL_out_dir/...).

📁 Files Changed

  • utils/workspace_context.py — Added path resolution, directory scanning, filtering logic, and context formatting.
  • treesearch/minimal_agent.py — Updated LLM prompt injection hooks to include the workspace context.

🧪 Proof of Functionality

1. User Prompt vs. Injected Context

When a user requests a custom dataset, the agent scans the workspace, detects the file, and embeds its metadata directly into the system prompt context:

Prompt:
image

Workspace with custom .csv:
image

2. Shared Options & Path Detection

The LLM is presented with both options: downloading the data or using the provided CSV.:

MinimalAgent: Getting plan and code
                    INFO     === Workspace Context injected into LLM prompt ===                                                                     minimal_agent.py:326
                    INFO     WORKSPACE_CTX: ## Workspace & File Context                                                                             minimal_agent.py:328
                    INFO     WORKSPACE_CTX:                                                                                                         minimal_agent.py:328
                    INFO     WORKSPACE_CTX: - **Project root:** `/home/pete-fed/PycharmProjects/AutoRecLab-group4`                                  minimal_agent.py:328
                    INFO     WORKSPACE_CTX: - **Code execution workspace:** `/home/pete-fed/PycharmProjects/AutoRecLab-group4/out/workspace`        minimal_agent.py:328
                    INFO     WORKSPACE_CTX:                                                                                                         minimal_agent.py:328
                    INFO     WORKSPACE_CTX: ### How to access data files                                                                            minimal_agent.py:328
                    INFO     WORKSPACE_CTX:                                                                                                         minimal_agent.py:328
                    INFO     WORKSPACE_CTX: When your code runs, the working directory is the code execution workspace listed above.                minimal_agent.py:328
                    INFO     WORKSPACE_CTX:                                                                                                         minimal_agent.py:328
                    INFO     WORKSPACE_CTX: You have two options to load datasets:                                                                  minimal_agent.py:328
                    INFO     WORKSPACE_CTX:                                                                                                         minimal_agent.py:328
                    INFO     WORKSPACE_CTX: 1. **Use the `dataloader` package** (recommended for standard datasets):                                minimal_agent.py:328
                    INFO     WORKSPACE_CTX:    ```python                                                                                            minimal_agent.py:328
                    INFO     WORKSPACE_CTX:    from dataloader.loaders.registry import _run_loader                                                  minimal_agent.py:328
                    INFO     WORKSPACE_CTX:    df = _run_loader('MovieLens100K')  # Downloads & caches automatically                                minimal_agent.py:328
                    INFO     WORKSPACE_CTX:    ```                                                                                                  minimal_agent.py:328
                    INFO     WORKSPACE_CTX:    Available datasets include: MovieLens100K, MovieLens1M, MovieLens10M, MovieLens20M,                  minimal_agent.py:328
                    INFO     WORKSPACE_CTX:    MovieLens25M, MovieLensLatest, MovieLensLatestSmall, MovieLens1BSynthetic,                           minimal_agent.py:328
                    INFO     WORKSPACE_CTX:    Amazon2014*, Amazon2018*, Amazon2023*, Yelp2023, Gowalla, BeerAdvocate, etc.                         minimal_agent.py:328
                    INFO     WORKSPACE_CTX:                                                                                                         minimal_agent.py:328
                    INFO     WORKSPACE_CTX: 2. **Use custom files placed in the workspace** (listed below):                                         minimal_agent.py:328
                    INFO     WORKSPACE_CTX: ## Custom data files in workspace                                                                       minimal_agent.py:328
                    INFO     WORKSPACE_CTX:                                                                                                         minimal_agent.py:328
                    INFO     WORKSPACE_CTX: - `groesste_studenten_uni_siegen.csv` — CSV data file (932.0 B)                                         minimal_agent.py:328
                    INFO     WORKSPACE_CTX:                                                                                                         minimal_agent.py:328
                    INFO     WORKSPACE_CTX: To use a custom file in your code, reference it by its filename                                         minimal_agent.py:328
                    INFO     WORKSPACE_CTX: (the code runs inside the workspace directory):                                                         minimal_agent.py:328
                    INFO     WORKSPACE_CTX: ```python                                                                                               minimal_agent.py:328
                    INFO     WORKSPACE_CTX: df = pd.read_csv('filename.csv')                                                                        minimal_agent.py:328
                    INFO     WORKSPACE_CTX: ```                                                                                                     minimal_agent.py:328
                    INFO     === End workspace context ===                                                                                          minimal_agent.py:329#

Log print was a quick and dirty debug idea - not commited

3. Execution Result (Successfully Created Row List)

Equipped with accurate file metadata, the agent compiles a valid runfile.py script, executing it successfully to yield the exact top 5 rows from the target CSV:

Matrikelnummer;Vorname;Nachname;Geschlecht;Alter;Studiengang;Fachsemester;Koerpergroesse_cm
1575400;Tim;Wagner;M;27;Wirtschaftsingenieurwesen (B.Sc.);18;209
1351722;Lukas;Becker;M;20;Informatik (M.Sc.);3;207
1591650;Julian;Schulz;M;28;Lehramt an Grundschulen;20;207
1341530;Noah;Bauer;M;27;Physik (B.Sc.);17;205
1216663;Ben;Richter;M;19;Wirtschaftsingenieurwesen (B.Sc.);1;202

The chage worked like a charm in three cases it found the right file imediatly without guessing the path :)

@TristanSchneider-dev
Copy link
Copy Markdown
Author

Edit: If desired instead of using workspace as a dataset storage a dedicated context aware folder to load stuff could be created / used

Copy link
Copy Markdown
Collaborator

@eisenbahnhero eisenbahnhero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my side. Thy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants