When I tried to reproduce this project using 4*H200, I encountered a situation where the training would stop unexpectedly after four or five hours.I tried many times and alfworld successfully trained once. I would like to ask if this problem is possible and how to solve it.
When I tried to reproduce this project using 4*H200, I encountered a situation where the training would stop unexpectedly after four or five hours.I tried many times and alfworld successfully trained once. I would like to ask if this problem is possible and how to solve it.