Interactive

The Reward Hacker

I wrote about this in The Reward Function Heist. The argument is simple: Reinforcement Learning from Human Feedback does not train models to be truthful. It trains them to be approved of. The gap between those two things is the entire problem, and the industry is scaling it.

Prose didn’t seem to land. So I built a demonstration instead. You are the RL agent. Make the optimal calls.

Read full post →