bigict · chungongyu · May 22, 2026
diff --git a/README_CDS_5UTR.md b/README_CDS_5UTR.md
@@ -0,0 +1,188 @@
+# Evo2 CDS→5UTR 微调项目配置
+
+本项目用于使用 CDS（编码序列）作为 prompt 微调 Evo2 模型，生成 5UTR 序列。
+
+## 文件结构
+
+```
+evo2/
+├── configs/
+│   └── cds_5utr_finetune_config.yaml    # 主配置文件
+├── scripts/
+│   ├── prepare_cds_5utr_data.py         # 数据准备脚本
+│   ├── run_cds_5utr_finetune.sh         # 训练自动化脚本
+│   └── generate_5utr.py                 # 5UTR 生成脚本
+└── README_CDS_5UTR.md                   # 本文件
+```
+
+## 快速开始
+
+### 步骤 1: 准备数据
+
+```bash
+# 从基因组 FASTA 和 GTF 注释提取 CDS 和 5UTR
+python scripts/prepare_cds_5utr_data.py \
+    --genome your_genome.fasta \
+    --annotation your_annotation.gtf \
+    --output cds_5utr_training.fasta \
+    --upstream_length 500
+```
+
+**输入要求：**
+- 基因组 FASTA 文件
+- GTF 注释文件（包含 CDS 和 mRNA/transcript 特征）
+
+**输出：**
+- 训练用 FASTA 文件（CDS+5UTR 连接格式）
+- 序列长度统计分析（帮助选择 seq_length）
+
+### 步骤 2: 配置训练参数
+
+编辑 `configs/cds_5utr_finetune_config.yaml`：
+
+```yaml
+preprocess:
+  input_path: "/path/to/cds_5utr_training.fasta"  # 修改为你的路径
+  output_prefix: "/path/to/preprocessed_data"
+  seq_length: 8192  # 根据步骤 1 的统计结果调整
+```
+
+### 步骤 3: 运行训练
+
+```bash
+# 推荐：使用 LoRA 微调（节省显存）
+./scripts/run_cds_5utr_finetune.sh lora
+
+# 其他选项：
+# ./scripts/run_cds_5utr_finetune.sh single   # 单卡全参数微调
+# ./scripts/run_cds_5utr_finetune.sh 2gpu     # 双卡全参数微调
+# ./scripts/run_cds_5utr_finetune.sh long     # 1M context 长序列训练
+```
+
+### 步骤 4: 生成 5UTR
+
+```bash
+# 单个序列测试
+python scripts/generate_5utr.py \
+    --model models/cds_5utr_model.pt \
+    --cds "ATGCGT..." \
+    --output test_5utr.txt
+
+# 批量生成
+python scripts/generate_5utr.py \
+    --model models/cds_5utr_model.pt \
+    --input cds_sequences.fasta \
+    --output generated_5utr.fasta \
+    --n_tokens 500 \
+    --temperature 1.0
+```
+
+## 硬件要求
+
+| 训练模式 | GPU | 显存 | 推荐 |
+|---------|-----|------|------|
+| LoRA 微调 | 1x H100/A100 | 40GB+ | ✅ 推荐入门 |
+| LoRA 微调 | 2x H100/A100 | 80GB | ✅ 最佳性价比 |
+| 全参数微调 | 2x H100/A100 | 80GB | ⚠️ 需要张量并行 |
+| 1M context | 2x H100/A100 | 80GB | ⚠️ 需要 FP8 |
+
+## 关键参数调优
+
+### seq_length 选择
+
+根据 `prepare_cds_5utr_data.py` 输出的统计结果：
+
+```
+推荐的 seq_length 设置:
+  覆盖 95% 数据：4096 (向上取整到 2 的幂：4096)
+  覆盖 99% 数据：8192 (向上取整到 2 的幂：8192)
+```
+
+- 如果 95% 的 CDS+5UTR 总长度 < 4096，使用 `seq_length: 4096`
+- 如果需要覆盖更长序列，使用 `seq_length: 8192` 或更高
+
+### LoRA 参数
+
+```yaml
+lora_dim: 16        # 可尝试：8, 16, 32, 64（越大参数量越多）
+lora_alpha: 32      # 通常是 lora_dim 的 2 倍
+lora_dropout: 0.1   # 防止过拟合
+```
+
+### 生成参数
+
+```bash
+--temperature 1.0   # 0.5-1.5：越低越保守，越高越随机
+--top_k 4           # 2-10：控制采样多样性
+--n_tokens 500      # 根据目标 5UTR 长度调整
+```
+
+## 环境设置
+
+### 使用 BioNemo Docker（推荐）
+
+```bash
+docker run --rm -it \
+  --gpus=all --ipc=host \
+  -v /path/to/evo2:/workspace/evo2 \
+  nvcr.io/nvidia/clara/bionemo-framework:nightly \
+  /bin/bash
+
+cd /workspace/evo2
+```
+
+### 或使用本地环境
+
+```bash
+# 安装依赖
+pip install -e .
+
+# 安装 BioNemo 相关工具
+git clone https://github.com/NVIDIA/bionemo-framework.git
+cd bionemo-framework
+./.ci_build.sh
+source ./.ci_test_env.sh
+```
+
+## 故障排查
+
+### 显存不足 (OOM)
+
+1. 降低 `micro_batch_size`
+2. 使用 LoRA 微调代替全参数微调
+3. 缩短 `seq_length`
+4. 启用梯度累积 (`gradient_accumulation_steps`)
+
+### 训练不收敛
+
+1. 检查数据格式是否正确
+2. 增加 `max_steps`
+3. 调整学习率（默认值通常适用）
+4. 确保 CDS 和 5UTR 质量良好
+
+### 生成质量差
+
+1. 增加训练步数
+2. 调整 `temperature`（尝试 0.7-1.2 范围）
+3. 检查训练数据量和质量
+4. 尝试全参数微调代替 LoRA
+
+## 相关资源
+
+- [BioNemo 框架文档](https://docs.nvidia.com/bionemo-framework/latest/)
+- [Evo2 GitHub](https://github.com/evo-design/evo2)
+- [Savanna 训练框架](https://github.com/Zymrael/savanna)
+
+## 引用
+
+如果使用此项目，请引用：
+
+```bibtex
+@article {king2025,
+   author = {King, Samuel H and Driscoll, Claudia L and Li, David B and Guo, Daniel and Merchant, Aditi T and Brixi, Garyk and Wilkinson, Max E and Hie, Brian L},
+   title = {Generative design of novel bacteriophages with genome language models},
+   year = {2025},
+   doi = {10.1101/2025.09.12.675911},
+   publisher = {Cold Spring Harbor Laboratory}
+}
+```