olmo_tap.final_evals.elo.run_tournament

Tournament orchestrator: response cache → judges → Elo → artifacts.

Wires together the existing per-entrant response cache, the LLM-judge batch pipeline, and the local permutation-averaged Elo engine. Two run modes are supported:

  • pilot: 50-prompt stratified subset judged by Sonnet, no prompt-cache pre-warming. Used as a cheap sanity gate before the headline run.

  • headline: the full 143-prompt bank judged by Opus, with a 1-query pre-warm batch per rubric so Anthropic’s prompt cache is populated before the bulk batch goes out.

Output directory layout:

runs/<mode>_<timestamp>/
├── manifest.json         # mode, model, prompt count, seeds, timestamps
├── matches/matches.jsonl # one record per (dimension, pair)
├── verdicts/{factuality,calibration,clinical_utility}.jsonl
├── elo_results.json      # mean ± SEM per entrant per dimension
├── elo_per_perm.npz      # full (n_perms × entrants) traces
└── pairwise_winrates.csv # per-pair win/loss/tie counts

Usage:

pixi run -e default python -m olmo_tap.final_evals.elo.run_tournament \
    --config olmo_tap/final_evals/elo/configs/tournament1.yaml \
    --mode pilot

Functions

judgments_to_matches(judgments)

Drop ties / inconsistent verdicts; return (a, b, winner) triples.

load_config(path)

load_response_cache(cache_dir, entrants)

Load every per-entrant responses_<entrant>.jsonl into a flat dict.

main([argv])

pairwise_winrates(judgments)

Aggregate per-pair counts.

parse_args([argv])

print_summary_table(primary, n_matches)

write_elo_per_perm(out_dir, primary)

Persist the full (n_perms × entrants) rating array per dimension.

write_elo_results(out_dir, *, primary, ...)

write_manifest(out_dir, *, mode, ...)

write_matches(out_dir, matches_by_dim)

write_pairwise_winrates_csv(out_dir, by_dim)

write_verdicts(out_dir, dimension, judgments)

olmo_tap.final_evals.elo.run_tournament.judgments_to_matches(judgments: Iterable[Judgment]) list[tuple[str, str, str | None]][source]

Drop ties / inconsistent verdicts; return (a, b, winner) triples.

olmo_tap.final_evals.elo.run_tournament.load_config(path: Path) dict[str, Any][source]
olmo_tap.final_evals.elo.run_tournament.load_response_cache(cache_dir: Path, entrants: Iterable[str]) dict[tuple[str, str], GeneratedResponse][source]

Load every per-entrant responses_<entrant>.jsonl into a flat dict.

Missing per-entrant files are reported and skipped; downstream match_builder.build_match_list() decides whether to abort by counting how many (entrant, prompt) cells are present.

olmo_tap.final_evals.elo.run_tournament.main(argv: list[str] | None = None) None[source]
olmo_tap.final_evals.elo.run_tournament.pairwise_winrates(judgments: Iterable[Judgment]) dict[tuple[str, str], dict[str, int]][source]

Aggregate per-pair counts. (a, b) keys are sorted lexicographically so (A, B) and (B, A) collapse onto one row.

olmo_tap.final_evals.elo.run_tournament.parse_args(argv: list[str] | None = None) Namespace[source]
olmo_tap.final_evals.elo.run_tournament.print_summary_table(primary: dict[Literal['factuality', 'calibration', 'clinical_utility'], dict[str, EloResult]], n_matches: dict[Literal['factuality', 'calibration', 'clinical_utility'], int]) None[source]
olmo_tap.final_evals.elo.run_tournament.write_elo_per_perm(out_dir: Path, primary: dict[Literal['factuality', 'calibration', 'clinical_utility'], dict[str, EloResult]]) None[source]

Persist the full (n_perms × entrants) rating array per dimension.

Saved as a single .npz archive with two keys per dimension: <dim>__data (float64, shape (n_perms, n_entrants)) and <dim>__entrants (string array of column ids).

olmo_tap.final_evals.elo.run_tournament.write_elo_results(out_dir: Path, *, primary: dict[Literal['factuality', 'calibration', 'clinical_utility'], dict[str, EloResult]], sweep: dict[Literal['factuality', 'calibration', 'clinical_utility'], dict[float, dict[str, EloResult]]], n_matches: dict[Literal['factuality', 'calibration', 'clinical_utility'], int], default_k: float) None[source]
olmo_tap.final_evals.elo.run_tournament.write_manifest(out_dir: Path, *, mode: Literal['pilot', 'headline'], config_path: Path, judge_model: str, entrants: list[str], bank_size: int, selected_size: int, n_perms: int, seed: int, pilot_seed: int, started_at: str, finished_at: str | None) None[source]
olmo_tap.final_evals.elo.run_tournament.write_matches(out_dir: Path, matches_by_dim: dict[Literal['factuality', 'calibration', 'clinical_utility'], list[PairToJudge]]) None[source]
olmo_tap.final_evals.elo.run_tournament.write_pairwise_winrates_csv(out_dir: Path, by_dim: dict[Literal['factuality', 'calibration', 'clinical_utility'], dict[tuple[str, str], dict[str, int]]]) None[source]
olmo_tap.final_evals.elo.run_tournament.write_verdicts(out_dir: Path, dimension: Literal['factuality', 'calibration', 'clinical_utility'], judgments: list[Judgment]) None[source]