Seems like the MCTSr authors did use ground truth information in the MCTS refinement process.
They use the LLM for determining the rewards, but the search terminates when the output is equal to the GT.
While a similar method could be used as an RL environment to train agents
Nothing surprising tbh, I've tried it 2 days ago before noise
The thing is that it's literally bruteforcing answers from llm.
What if we don't know the ground truth?
It will fail miserably at those tasks.
Another difference is that RLHF doesn't do proper exploration: it mostly learns to exploit a subset of the pretraining trajectories.
In contrast, when doing proper RL the discrete action distribution is usually noised by adding an entropy term to the loss function.