olmo_tap.final_evals.elo.elo_engine

Permutation-averaged Elo with K-factor sensitivity sweep.

Implements the robustness recipe from Boubdir et al. (2023), Elo Uncovered: Robustness and Best Practices in Language Model Evaluation:

  • Initial rating R_0 = 1400 for every entrant.

  • Update rule R'_A = R_A + K * (S_A - E_A) with E_A = 1 / (1 + 10^((R_B - R_A) / 400)).

  • Ties are dropped from the match list (consistent with the paper’s handling of inconsistent judge verdicts).

  • The match list is shuffled n_perms times; per-permutation Elo is computed independently and we report mean ± SEM across permutations rather than a single-pass score.

  • A K-factor sweep returns the same per-entrant statistics over a list of K values, enabling the sensitivity heatmap (Figure 3 of the paper).

The input is a flat list of (entrant_a, entrant_b, winner) triples where winner is either one of the entrant ids or None / "TIE" to mark a tie. See compute_elo_permutation() for the full contract.

Implementation notes:

  • All shuffling uses numpy.random.default_rng(seed) and the seed is surfaced on the result for reproducibility / logging.

  • The hot loop is written with simple Python and math.pow rather than NumPy because matches are processed serially per permutation; the speedup from vectorising would be marginal and would obscure the update rule.

  • Ratings are stored in a plain dict keyed by entrant id so the engine is agnostic to entrant ordering and to the entrant set.

Functions

compute_elo_permutation(matches, *[, k, ...])

Compute permutation-averaged Elo ratings.

k_factor_sweep(matches, *[, k_values, ...])

Run compute_elo_permutation() across a list of K-factors.

rank_entrants(results)

Return [(entrant_id, mean_rating), ...] sorted high-to-low.

Classes

EloResult(entrant_id, mean, sem, ci95_low, ...)

Per-entrant rating statistics across permutations.

class olmo_tap.final_evals.elo.elo_engine.EloResult(entrant_id: str, mean: float, sem: float, ci95_low: float, ci95_high: float, per_perm_ratings: ndarray)[source]

Bases: object

Per-entrant rating statistics across permutations.

entrant_id

The entrant the statistics are for.

Type:

str

mean

Mean Elo rating across the permutations.

Type:

float

sem

Standard error of the mean (std(ddof=1) / sqrt(n_perms)).

Type:

float

ci95_low

mean - 1.96 * sem.

Type:

float

ci95_high

mean + 1.96 * sem.

Type:

float

per_perm_ratings

1-D array of per-permutation final ratings; the full trace is returned so that downstream reports can plot distributions, not just summary statistics.

Type:

numpy.ndarray

ci95_high: float
ci95_low: float
entrant_id: str
mean: float
per_perm_ratings: ndarray
sem: float
olmo_tap.final_evals.elo.elo_engine.compute_elo_permutation(matches: Iterable[tuple[str, str, str | None]], *, k: float = 16.0, initial_rating: float = 1400.0, n_perms: int = 500, seed: int = 0) dict[str, EloResult][source]

Compute permutation-averaged Elo ratings.

The match list is shuffled n_perms times with numpy.random.default_rng(seed)(); for each permutation the Elo update is run sequentially and the final rating per entrant is recorded. The returned dict reports mean ± SEM and 95% CI per entrant alongside the full per-permutation trace.

Parameters:
  • matches – Iterable of (entrant_a, entrant_b, winner) triples. winner may be None or "TIE" to mark a tie; ties are dropped before the permutation loop.

  • k – Elo K-factor.

  • initial_rating – Starting rating for every entrant on every permutation. Boubdir et al. (2023) fix this at 1400.

  • n_perms – Number of independent permutations to run. The paper recommends >= 100; the headline run uses 500.

  • seed – Seed for the NumPy Generator driving the shuffles. Logged on the result so runs are reproducible.

Returns:

Mapping entrant_id -> EloResult for every entrant referenced by matches.

Raises:

ValueError – If n_perms < 1, if matches is empty, or if any match has a winner that is neither one of the two entrants nor a recognised tie marker.

olmo_tap.final_evals.elo.elo_engine.k_factor_sweep(matches: Iterable[tuple[str, str, str | None]], *, k_values: Sequence[int | float] = (1, 4, 8, 16, 32), initial_rating: float = 1400.0, n_perms: int = 500, seed: int = 0) dict[float, dict[str, EloResult]][source]

Run compute_elo_permutation() across a list of K-factors.

Returns a heatmap-ready {k: {entrant_id: EloResult}} structure. The same shuffle seed is used for every K so the difference across the sweep reflects the K-factor’s effect rather than shuffle variance.

Parameters:
  • matches – Same contract as compute_elo_permutation().

  • k_values – K-factors to sweep. Default {1, 4, 8, 16, 32} covers the range where the Boubdir paper observed ranking changes on their benchmark suites.

  • initial_rating – Starting rating per entrant per permutation.

  • n_perms – Permutations per K-factor.

  • seed – Seed for every K’s permutation loop.

Returns:

Mapping k -> {entrant_id: EloResult}. K-factors are stored as float keys for stable hashing across int / float inputs.

olmo_tap.final_evals.elo.elo_engine.rank_entrants(results: dict[str, EloResult]) list[tuple[str, float]][source]

Return [(entrant_id, mean_rating), ...] sorted high-to-low.