olmo_tap.final_evals.elo.elo_engine¶

Permutation-averaged Elo with K-factor sensitivity sweep.

Implements the robustness recipe from Boubdir et al. (2023), Elo Uncovered: Robustness and Best Practices in Language Model Evaluation:

Initial rating R_0 = 1400 for every entrant.

Update rule R'_A = R_A + K * (S_A - E_A) with E_A = 1 / (1 + 10^((R_B - R_A) / 400)).

Ties are dropped from the match list (consistent with the paper’s handling of inconsistent judge verdicts).

The match list is shuffled n_perms times; per-permutation Elo is computed independently and we report mean ± SEM across permutations rather than a single-pass score.

A K-factor sweep returns the same per-entrant statistics over a list of K values, enabling the sensitivity heatmap (Figure 3 of the paper).

The input is a flat list of (entrant_a, entrant_b, winner) triples where winner is either one of the entrant ids or None / "TIE" to mark a tie. See compute_elo_permutation() for the full contract.

Implementation notes:

All shuffling uses numpy.random.default_rng(seed) and the seed is surfaced on the result for reproducibility / logging.

The hot loop is written with simple Python and math.pow rather than NumPy because matches are processed serially per permutation; the speedup from vectorising would be marginal and would obscure the update rule.

Ratings are stored in a plain dict keyed by entrant id so the engine is agnostic to entrant ordering and to the entrant set.

Functions

`compute_elo_permutation`(matches, *[, k, ...])	Compute permutation-averaged Elo ratings.
`k_factor_sweep`(matches, *[, k_values, ...])	Run `compute_elo_permutation()` across a list of K-factors.
`rank_entrants`(results)	Return `[(entrant_id, mean_rating), ...]` sorted high-to-low.

Classes

EloResult(entrant_id, mean, sem, ci95_low, ...)

Per-entrant rating statistics across permutations.

class olmo_tap.final_evals.elo.elo_engine.EloResult(entrant_id: str, mean: float, sem: float, ci95_low: float, ci95_high: float, per_perm_ratings: ndarray)[source]¶

Bases: object

Per-entrant rating statistics across permutations.

entrant_id¶

The entrant the statistics are for.

Type:: str

mean¶

Mean Elo rating across the permutations.

Type:: float

sem¶

Standard error of the mean (std(ddof=1) / sqrt(n_perms)).

Type:: float

ci95_low¶

mean - 1.96 * sem.

Type:: float

ci95_high¶

mean + 1.96 * sem.

Type:: float

per_perm_ratings¶

1-D array of per-permutation final ratings; the full trace is returned so that downstream reports can plot distributions, not just summary statistics.

Type:: numpy.ndarray

ci95_high: float¶

ci95_low: float¶

entrant_id: str¶

mean: float¶

per_perm_ratings: ndarray¶

sem: float¶

olmo_tap.final_evals.elo.elo_engine.compute_elo_permutation(matches: Iterable[tuple[str, str, str | None]], *, k: float = 16.0, initial_rating: float = 1400.0, n_perms: int = 500, seed: int = 0) → dict[str, EloResult][source]¶

Compute permutation-averaged Elo ratings.

The match list is shuffled n_perms times with numpy.random.default_rng(seed)(); for each permutation the Elo update is run sequentially and the final rating per entrant is recorded. The returned dict reports mean ± SEM and 95% CI per entrant alongside the full per-permutation trace.

Parameters:

matches – Iterable of (entrant_a, entrant_b, winner) triples. winner may be None or "TIE" to mark a tie; ties are dropped before the permutation loop.
k – Elo K-factor.
initial_rating – Starting rating for every entrant on every permutation. Boubdir et al. (2023) fix this at 1400.
n_perms – Number of independent permutations to run. The paper recommends >= 100; the headline run uses 500.
seed – Seed for the NumPy Generator driving the shuffles. Logged on the result so runs are reproducible.

Returns:

Mapping entrant_id -> EloResult for every entrant referenced by matches.

Raises:

ValueError – If n_perms < 1, if matches is empty, or if any match has a winner that is neither one of the two entrants nor a recognised tie marker.

olmo_tap.final_evals.elo.elo_engine.k_factor_sweep(matches: Iterable[tuple[str, str, str | None]], *, k_values: Sequence[int | float] = (1, 4, 8, 16, 32), initial_rating: float = 1400.0, n_perms: int = 500, seed: int = 0) → dict[float, dict[str, EloResult]][source]¶

Run compute_elo_permutation() across a list of K-factors.

Returns a heatmap-ready {k: {entrant_id: EloResult}} structure. The same shuffle seed is used for every K so the difference across the sweep reflects the K-factor’s effect rather than shuffle variance.

Parameters:

matches – Same contract as compute_elo_permutation().
k_values – K-factors to sweep. Default {1, 4, 8, 16, 32} covers the range where the Boubdir paper observed ranking changes on their benchmark suites.
initial_rating – Starting rating per entrant per permutation.
n_perms – Permutations per K-factor.
seed – Seed for every K’s permutation loop.

Returns:

Mapping k -> {entrant_id: EloResult}. K-factors are stored as float keys for stable hashing across int / float inputs.

olmo_tap.final_evals.elo.elo_engine.rank_entrants(results: dict[str, EloResult]) → list[tuple[str, float]][source]¶: Return [(entrant_id, mean_rating), ...] sorted high-to-low.