olmo_tap.final_evals.elo.report

Tournament reporting helpers.

Stubbed for now. Will produce:

  • elo_results.json — mean / SEM / 95% CI per entrant per dimension, plus the full per-permutation traces.

  • sensitivity_heatmap.png — K × entrant heatmap mirroring the ranking-stability figure from Boubdir et al. (2023).

  • pairwise_winrates.csv — raw win/loss/tie counts per pair per dimension (pre-Elo).

  • judge_log.jsonl — every judge query with full inputs, verdict, reasoning trace, and cache key.

  • run_manifest.json — timestamps, seeds, model versions, prompt-set hash, rubric version (so reviewers can verify reproducibility).

Functions

render_sensitivity_heatmap(sweep, out_path, ...)

Render the K × entrant heatmap (one PNG per dimension).

write_results_json(results_per_dim, out_path)

Serialise per-dimension Elo results to out_path.

write_run_manifest(config, out_path)

Snapshot every reproducibility-relevant input for the report.

olmo_tap.final_evals.elo.report.render_sensitivity_heatmap(sweep: Mapping[float, dict[str, EloResult]], out_path: Path, *, dimension: str) None[source]

Render the K × entrant heatmap (one PNG per dimension).

olmo_tap.final_evals.elo.report.write_results_json(results_per_dim: Mapping[str, dict[str, EloResult]], out_path: Path) None[source]

Serialise per-dimension Elo results to out_path.

olmo_tap.final_evals.elo.report.write_run_manifest(config: Mapping[str, Any], out_path: Path) None[source]

Snapshot every reproducibility-relevant input for the report.