This repository contains the course project implementation for a long-term memory dialogue agent. It includes baseline agents, an append-only memory agent, an update/merge memory agent, and an update + reflection agent.
Create an isolated conda environment:
mamba env create -f environment.yml
conda activate memory-agentIf the environment already exists, update it with:
mamba env update -f environment.yml --prune
conda activate memory-agentStart with a small sample before running the full evaluation:
cd eval_kit
python prepare_eval_set.py --output eval_set_small.json --per_category 2 --seed 42
cd ..The no-memory baseline answers only from the question itself and is intended as the required control group.
First configure an OpenAI-compatible chat endpoint. For a local vLLM server, the
defaults in eval_kit/llm_client.py are enough:
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY"
export LLM_MODEL="Qwen/Qwen2.5-3B-Instruct-AWQ"On macOS, the simplest local setup is Ollama:
brew services start ollama
ollama pull qwen2.5:3b
source env.local-ollama.shThen run generation:
python eval_kit/run_generation.py \
--eval_set eval_kit/eval_set_small.json \
--agent memory_agent.agent.baselines:NoMemoryAgent \
--output experiments/results/predictions_nomem_small.jsonIf you do not want to activate the environment in the current shell, use:
conda run -n memory-agent python eval_kit/run_generation.py \
--eval_set eval_kit/eval_set_small.json \
--agent memory_agent.agent.baselines:NoMemoryAgent \
--output experiments/results/predictions_nomem_small.jsonFor cloud APIs, set LLM_BASE_URL, LLM_API_KEY, and LLM_MODEL to the
provider's OpenAI-compatible endpoint before running the same command.
The final experimental agent is:
memory_agent.agent.controller:UpdateReflectionAgent
It uses extracted memories, high-signal detail notes, rule-based memory deduplication/merge, and reflection memories.
For a quick small-set run:
MEMORY_PER_SESSION=4 \
MEMORY_TOP_K=10 \
MEMORY_DETAIL_NOTES=1 \
MEMORY_DETAIL_NOTES_PER_SESSION=3 \
MEMORY_UPDATE_USE_LLM=0 \
MEMORY_REFLECTIONS=6 \
MEMORY_REFLECTION_INPUTS=120 \
MEMORY_ALLOW_INFERENCE=0 \
python eval_kit/run_generation.py \
--eval_set eval_kit/eval_set_small.json \
--agent memory_agent.agent.controller:UpdateReflectionAgent \
--output experiments/results/predictions_update_reflection_small.json \
--resumeMEMORY_ALLOW_INFERENCE=1 is an optional prompt ablation that lets the model
make cautious "likely/probably" inferences from indirect memories. Current
small-set results show fewer unknown answers but slightly lower rough F1, so
the main reported p10 configuration keeps it disabled.
For the p10 experiment used in the current result summary:
bash experiments/run_p10_experiments.shThe generated p10 files include:
experiments/results/predictions_nomem_p10.json
experiments/results/predictions_fullctx_p10.json
experiments/results/predictions_rag_p10.json
experiments/results/predictions_update_reflection_p10.json
Local judge results with qwen2.5:3b are included as development references.
For official-style judging, configure a stronger OpenAI-compatible endpoint
such as DeepSeek or DashScope:
export LLM_BASE_URL="https://api.deepseek.com/v1"
export LLM_API_KEY="sk-..."
export LLM_MODEL="deepseek-v4-flash"
bash experiments/run_p10_cloud_judge.shThe current experiment record is:
experiments/results/small_experiment_summary.md
The machine-readable result manifest is:
experiments/results/experiment_manifest.json