olmo_tap.final_evals.elo.scripts.validate_judge

End-to-end validation of the judge pipeline against a hand-crafted pair.

Builds one (response_a, response_b) pair where response_a is clearly better on every dimension (a coherent medical answer vs. a nonsense reply), runs judge_pairs against all three rubrics with Sonnet 4.6, and asserts that the verdict is A for each. Reasoning traces and cache stats are printed so a human can eyeball the calls.

Run with:

pixi run -e default python -m olmo_tap.final_evals.elo.scripts.validate_judge

The script must use Sonnet (cheap); Opus is reserved for the headline run of the tournament. Each invocation costs about $0.01.

Functions

main()

olmo_tap.final_evals.elo.scripts.validate_judge.main() int[source]