olmo_tap.final_evals.elo.run_tournament¶
Tournament orchestrator: response cache → judges → Elo → artifacts.
Wires together the existing per-entrant response cache, the LLM-judge batch pipeline, and the local permutation-averaged Elo engine. Two run modes are supported:
pilot: 50-prompt stratified subset judged by Sonnet, no prompt-cache pre-warming. Used as a cheap sanity gate before the headline run.
headline: the full 143-prompt bank judged by Opus, with a 1-query pre-warm batch per rubric so Anthropic’s prompt cache is populated before the bulk batch goes out.
Output directory layout:
runs/<mode>_<timestamp>/
├── manifest.json # mode, model, prompt count, seeds, timestamps
├── matches/matches.jsonl # one record per (dimension, pair)
├── verdicts/{factuality,calibration,clinical_utility}.jsonl
├── elo_results.json # mean ± SEM per entrant per dimension
├── elo_per_perm.npz # full (n_perms × entrants) traces
└── pairwise_winrates.csv # per-pair win/loss/tie counts
Usage:
pixi run -e default python -m olmo_tap.final_evals.elo.run_tournament \
--config olmo_tap/final_evals/elo/configs/tournament1.yaml \
--mode pilot
Functions
|
Drop ties / inconsistent verdicts; return |
|
|
|
Load every per-entrant |
|
|
|
Aggregate per-pair counts. |
|
|
|
|
|
Persist the full |
|
|
|
|
|
|
|
|
|
- olmo_tap.final_evals.elo.run_tournament.judgments_to_matches(judgments: Iterable[Judgment]) list[tuple[str, str, str | None]][source]¶
Drop ties / inconsistent verdicts; return
(a, b, winner)triples.
- olmo_tap.final_evals.elo.run_tournament.load_response_cache(cache_dir: Path, entrants: Iterable[str]) dict[tuple[str, str], GeneratedResponse][source]¶
Load every per-entrant
responses_<entrant>.jsonlinto a flat dict.Missing per-entrant files are reported and skipped; downstream
match_builder.build_match_list()decides whether to abort by counting how many(entrant, prompt)cells are present.
- olmo_tap.final_evals.elo.run_tournament.pairwise_winrates(judgments: Iterable[Judgment]) dict[tuple[str, str], dict[str, int]][source]¶
Aggregate per-pair counts.
(a, b)keys are sorted lexicographically so(A, B)and(B, A)collapse onto one row.
- olmo_tap.final_evals.elo.run_tournament.parse_args(argv: list[str] | None = None) Namespace[source]¶
- olmo_tap.final_evals.elo.run_tournament.print_summary_table(primary: dict[Literal['factuality', 'calibration', 'clinical_utility'], dict[str, EloResult]], n_matches: dict[Literal['factuality', 'calibration', 'clinical_utility'], int]) None[source]¶
- olmo_tap.final_evals.elo.run_tournament.write_elo_per_perm(out_dir: Path, primary: dict[Literal['factuality', 'calibration', 'clinical_utility'], dict[str, EloResult]]) None[source]¶
Persist the full
(n_perms × entrants)rating array per dimension.Saved as a single
.npzarchive with two keys per dimension:<dim>__data(float64, shape(n_perms, n_entrants)) and<dim>__entrants(string array of column ids).
- olmo_tap.final_evals.elo.run_tournament.write_elo_results(out_dir: Path, *, primary: dict[Literal['factuality', 'calibration', 'clinical_utility'], dict[str, EloResult]], sweep: dict[Literal['factuality', 'calibration', 'clinical_utility'], dict[float, dict[str, EloResult]]], n_matches: dict[Literal['factuality', 'calibration', 'clinical_utility'], int], default_k: float) None[source]¶
- olmo_tap.final_evals.elo.run_tournament.write_manifest(out_dir: Path, *, mode: Literal['pilot', 'headline'], config_path: Path, judge_model: str, entrants: list[str], bank_size: int, selected_size: int, n_perms: int, seed: int, pilot_seed: int, started_at: str, finished_at: str | None) None[source]¶
- olmo_tap.final_evals.elo.run_tournament.write_matches(out_dir: Path, matches_by_dim: dict[Literal['factuality', 'calibration', 'clinical_utility'], list[PairToJudge]]) None[source]¶