olmo_tap.final_evals.elo.report¶
Tournament reporting helpers.
Stubbed for now. Will produce:
elo_results.json— mean / SEM / 95% CI per entrant per dimension, plus the full per-permutation traces.
sensitivity_heatmap.png— K × entrant heatmap mirroring the ranking-stability figure from Boubdir et al. (2023).
pairwise_winrates.csv— raw win/loss/tie counts per pair per dimension (pre-Elo).
judge_log.jsonl— every judge query with full inputs, verdict, reasoning trace, and cache key.
run_manifest.json— timestamps, seeds, model versions, prompt-set hash, rubric version (so reviewers can verify reproducibility).
Functions
|
Render the K × entrant heatmap (one PNG per dimension). |
|
Serialise per-dimension Elo results to |
|
Snapshot every reproducibility-relevant input for the report. |
- olmo_tap.final_evals.elo.report.render_sensitivity_heatmap(sweep: Mapping[float, dict[str, EloResult]], out_path: Path, *, dimension: str) None[source]¶
Render the K × entrant heatmap (one PNG per dimension).