olmo_tap.final_evals.elo.elo_engine¶
Permutation-averaged Elo with K-factor sensitivity sweep.
Implements the robustness recipe from Boubdir et al. (2023), Elo Uncovered: Robustness and Best Practices in Language Model Evaluation:
Initial rating
R_0 = 1400for every entrant.Update rule
R'_A = R_A + K * (S_A - E_A)withE_A = 1 / (1 + 10^((R_B - R_A) / 400)).Ties are dropped from the match list (consistent with the paper’s handling of inconsistent judge verdicts).
The match list is shuffled
n_permstimes; per-permutation Elo is computed independently and we report mean ± SEM across permutations rather than a single-pass score.A K-factor sweep returns the same per-entrant statistics over a list of K values, enabling the sensitivity heatmap (Figure 3 of the paper).
The input is a flat list of (entrant_a, entrant_b, winner) triples where
winner is either one of the entrant ids or None / "TIE" to mark a
tie. See compute_elo_permutation() for the full contract.
Implementation notes:
All shuffling uses
numpy.random.default_rng(seed)and the seed is surfaced on the result for reproducibility / logging.The hot loop is written with simple Python and
math.powrather than NumPy because matches are processed serially per permutation; the speedup from vectorising would be marginal and would obscure the update rule.Ratings are stored in a plain
dictkeyed by entrant id so the engine is agnostic to entrant ordering and to the entrant set.
Functions
|
Compute permutation-averaged Elo ratings. |
|
Run |
|
Return |
Classes
|
Per-entrant rating statistics across permutations. |
- class olmo_tap.final_evals.elo.elo_engine.EloResult(entrant_id: str, mean: float, sem: float, ci95_low: float, ci95_high: float, per_perm_ratings: ndarray)[source]¶
Bases:
objectPer-entrant rating statistics across permutations.
- per_perm_ratings¶
1-D array of per-permutation final ratings; the full trace is returned so that downstream reports can plot distributions, not just summary statistics.
- Type:
numpy.ndarray
- per_perm_ratings: ndarray¶
- olmo_tap.final_evals.elo.elo_engine.compute_elo_permutation(matches: Iterable[tuple[str, str, str | None]], *, k: float = 16.0, initial_rating: float = 1400.0, n_perms: int = 500, seed: int = 0) dict[str, EloResult][source]¶
Compute permutation-averaged Elo ratings.
The match list is shuffled
n_permstimes withnumpy.random.default_rng(seed)(); for each permutation the Elo update is run sequentially and the final rating per entrant is recorded. The returned dict reports mean ± SEM and 95% CI per entrant alongside the full per-permutation trace.- Parameters:
matches – Iterable of
(entrant_a, entrant_b, winner)triples.winnermay beNoneor"TIE"to mark a tie; ties are dropped before the permutation loop.k – Elo K-factor.
initial_rating – Starting rating for every entrant on every permutation. Boubdir et al. (2023) fix this at 1400.
n_perms – Number of independent permutations to run. The paper recommends
>= 100; the headline run uses 500.seed – Seed for the NumPy
Generatordriving the shuffles. Logged on the result so runs are reproducible.
- Returns:
Mapping
entrant_id -> EloResultfor every entrant referenced bymatches.- Raises:
ValueError – If
n_perms < 1, ifmatchesis empty, or if any match has a winner that is neither one of the two entrants nor a recognised tie marker.
- olmo_tap.final_evals.elo.elo_engine.k_factor_sweep(matches: Iterable[tuple[str, str, str | None]], *, k_values: Sequence[int | float] = (1, 4, 8, 16, 32), initial_rating: float = 1400.0, n_perms: int = 500, seed: int = 0) dict[float, dict[str, EloResult]][source]¶
Run
compute_elo_permutation()across a list of K-factors.Returns a heatmap-ready
{k: {entrant_id: EloResult}}structure. The same shuffleseedis used for every K so the difference across the sweep reflects the K-factor’s effect rather than shuffle variance.- Parameters:
matches – Same contract as
compute_elo_permutation().k_values – K-factors to sweep. Default
{1, 4, 8, 16, 32}covers the range where the Boubdir paper observed ranking changes on their benchmark suites.initial_rating – Starting rating per entrant per permutation.
n_perms – Permutations per K-factor.
seed – Seed for every K’s permutation loop.
- Returns:
Mapping
k -> {entrant_id: EloResult}. K-factors are stored asfloatkeys for stable hashing across int / float inputs.