Is your feature request related to a problem? Please describe.
Different libraries/environments for training vs generation can lead to unexpected errors which are hard to debug since RL is robust enough that just observing rewards isn't a good enough check.
Describe the solution you'd like
We have already added a "Adding New Models" guide which explains the importance of tracking logprob errors. The guide should be accompanied by a script that the user is expected to run every time they bring in a new model with RL pipeline which essentially tests if training and inference frameworks for a new model are compatible.
Is your feature request related to a problem? Please describe.
Different libraries/environments for training vs generation can lead to unexpected errors which are hard to debug since RL is robust enough that just observing rewards isn't a good enough check.
Describe the solution you'd like
We have already added a "Adding New Models" guide which explains the importance of tracking logprob errors. The guide should be accompanied by a script that the user is expected to run every time they bring in a new model with RL pipeline which essentially tests if training and inference frameworks for a new model are compatible.