olmo_tap.final_evals.elo.scripts.validate_judge¶
End-to-end validation of the judge pipeline against a hand-crafted pair.
Builds one (response_a, response_b) pair where response_a is clearly
better on every dimension (a coherent medical answer vs. a nonsense
reply), runs judge_pairs against all three rubrics with
Sonnet 4.6, and asserts that the verdict is A for each. Reasoning
traces and cache stats are printed so a human can eyeball the calls.
Run with:
pixi run -e default python -m olmo_tap.final_evals.elo.scripts.validate_judge
The script must use Sonnet (cheap); Opus is reserved for the headline run of the tournament. Each invocation costs about $0.01.
Functions
|