olmo_tap.final_evals.elo.run_tournament¶

Tournament orchestrator: response cache → judges → Elo → artifacts.

Wires together the existing per-entrant response cache, the LLM-judge batch pipeline, and the local permutation-averaged Elo engine. Two run modes are supported:

pilot: 50-prompt stratified subset judged by Sonnet, no prompt-cache pre-warming. Used as a cheap sanity gate before the headline run.

headline: the full 143-prompt bank judged by Opus, with a 1-query pre-warm batch per rubric so Anthropic’s prompt cache is populated before the bulk batch goes out.

Output directory layout:

runs/<mode>_<timestamp>/
├── manifest.json         # mode, model, prompt count, seeds, timestamps
├── matches/matches.jsonl # one record per (dimension, pair)
├── verdicts/{factuality,calibration,clinical_utility}.jsonl
├── elo_results.json      # mean ± SEM per entrant per dimension
├── elo_per_perm.npz      # full (n_perms × entrants) traces
└── pairwise_winrates.csv # per-pair win/loss/tie counts

Usage:

pixi run -e default python -m olmo_tap.final_evals.elo.run_tournament \
    --config olmo_tap/final_evals/elo/configs/tournament1.yaml \
    --mode pilot

Functions

`judgments_to_matches`(judgments)	Drop ties / inconsistent verdicts; return `(a, b, winner)` triples.
`load_config`(path)
`load_response_cache`(cache_dir, entrants)	Load every per-entrant `responses_<entrant>.jsonl` into a flat dict.
`main`([argv])
`pairwise_winrates`(judgments)	Aggregate per-pair counts.
`parse_args`([argv])
`print_summary_table`(primary, n_matches)
`write_elo_per_perm`(out_dir, primary)	Persist the full `(n_perms × entrants)` rating array per dimension.
`write_elo_results`(out_dir, *, primary, ...)
`write_manifest`(out_dir, *, mode, ...)
`write_matches`(out_dir, matches_by_dim)
`write_pairwise_winrates_csv`(out_dir, by_dim)
`write_verdicts`(out_dir, dimension, judgments)

olmo_tap.final_evals.elo.run_tournament.judgments_to_matches(judgments: Iterable[Judgment]) → list[tuple[str, str, str | None]][source]¶: Drop ties / inconsistent verdicts; return (a, b, winner) triples.

olmo_tap.final_evals.elo.run_tournament.load_config(path: Path) → dict[str, Any][source]¶

olmo_tap.final_evals.elo.run_tournament.load_response_cache(cache_dir: Path, entrants: Iterable[str]) → dict[tuple[str, str], GeneratedResponse][source]¶

Load every per-entrant responses_<entrant>.jsonl into a flat dict.

Missing per-entrant files are reported and skipped; downstream match_builder.build_match_list() decides whether to abort by counting how many (entrant, prompt) cells are present.

olmo_tap.final_evals.elo.run_tournament.main(argv: list[str] | None = None) → None[source]¶

olmo_tap.final_evals.elo.run_tournament.pairwise_winrates(judgments: Iterable[Judgment]) → dict[tuple[str, str], dict[str, int]][source]¶: Aggregate per-pair counts. (a, b) keys are sorted lexicographically so (A, B) and (B, A) collapse onto one row.

olmo_tap.final_evals.elo.run_tournament.parse_args(argv: list[str] | None = None) → Namespace[source]¶

olmo_tap.final_evals.elo.run_tournament.print_summary_table(primary: dict[Literal['factuality', 'calibration', 'clinical_utility'], dict[str, EloResult]], n_matches: dict[Literal['factuality', 'calibration', 'clinical_utility'], int]) → None[source]¶

olmo_tap.final_evals.elo.run_tournament.write_elo_per_perm(out_dir: Path, primary: dict[Literal['factuality', 'calibration', 'clinical_utility'], dict[str, EloResult]]) → None[source]¶

Persist the full (n_perms × entrants) rating array per dimension.

Saved as a single .npz archive with two keys per dimension: <dim>__data (float64, shape (n_perms, n_entrants)) and <dim>__entrants (string array of column ids).

olmo_tap.final_evals.elo.run_tournament.write_elo_results(out_dir: Path, *, primary: dict[Literal['factuality', 'calibration', 'clinical_utility'], dict[str, EloResult]], sweep: dict[Literal['factuality', 'calibration', 'clinical_utility'], dict[float, dict[str, EloResult]]], n_matches: dict[Literal['factuality', 'calibration', 'clinical_utility'], int], default_k: float) → None[source]¶

olmo_tap.final_evals.elo.run_tournament.write_manifest(out_dir: Path, *, mode: Literal['pilot', 'headline'], config_path: Path, judge_model: str, entrants: list[str], bank_size: int, selected_size: int, n_perms: int, seed: int, pilot_seed: int, started_at: str, finished_at: str | None) → None[source]¶

olmo_tap.final_evals.elo.run_tournament.write_matches(out_dir: Path, matches_by_dim: dict[Literal['factuality', 'calibration', 'clinical_utility'], list[PairToJudge]]) → None[source]¶

olmo_tap.final_evals.elo.run_tournament.write_pairwise_winrates_csv(out_dir: Path, by_dim: dict[Literal['factuality', 'calibration', 'clinical_utility'], dict[tuple[str, str], dict[str, int]]]) → None[source]¶

olmo_tap.final_evals.elo.run_tournament.write_verdicts(out_dir: Path, dimension: Literal['factuality', 'calibration', 'clinical_utility'], judgments: list[Judgment]) → None[source]¶