olmo_tap.final_evals.elo.match_builder

Pairwise match list construction for the configuration-level Elo run.

Given the prompt bank, the per-entrant response cache, and the entrant roster, this module produces the PairToJudge payloads consumed by the LLM-judge pipeline. One PairToJudge is produced per (prompt, unordered entrant pair, dimension) triple; both position orderings (A-vs-B and B-vs-A) are generated inside judge_pairs() itself, so the builder is concerned with unique pairs only.

Two filtering rules are baked in:

  • Factuality requires a gold answer, so curated trustworthiness prompts (source == "curated") are dropped from that dimension only. Calibration and clinical_utility evaluate framing and run on the full bank.

  • Prompts whose responses are missing for any participating entrant are dropped with a warning, so a partially-populated response cache still produces a coherent (smaller) match list rather than tripping a KeyError deep inside the judge call.

A second helper, select_pilot_subset(), draws a stratified 50-prompt subset for the pilot run.

Functions

build_match_list(bank, response_cache, entrants)

Construct the per-dimension pair lists fed to the LLM judge.

select_pilot_subset(bank[, n, seed])

Draw a reproducible stratified subset of the prompt bank.

olmo_tap.final_evals.elo.match_builder.build_match_list(bank: list[Prompt], response_cache: dict[tuple[str, str], GeneratedResponse], entrants: list[str], dimensions: Iterable[Literal['factuality', 'calibration', 'clinical_utility']] | None = None) dict[Literal['factuality', 'calibration', 'clinical_utility'], list[PairToJudge]][source]

Construct the per-dimension pair lists fed to the LLM judge.

For each unordered pair of distinct entrants, each prompt in bank (after source filtering), and each dimension in dimensions, one PairToJudge is produced. The canonical (entrant_a, entrant_b) ordering follows the input order of entrants; the matching B-vs-A query is fired by judge_pairs() internally and is not represented here.

Curated prompts (source == "curated") are excluded from factuality because they have no gold answer; they are kept for calibration and clinical_utility.

Parameters:
  • bank – Prompt bank loaded via generate.load_prompt_bank().

  • response_cache – Mapping (entrant_id, prompt_id) response. Prompts missing any entrant’s response are skipped with a warning.

  • entrants – Ordered list of entrant ids. Determines pair generation order via itertools.combinations(). Must contain at least two unique ids.

  • dimensions – Dimensions to build pair lists for. Defaults to all three (factuality, calibration, clinical_utility).

Returns:

{dimension: [PairToJudge, ...]}. Empty list for any dimension that has no surviving prompts.

olmo_tap.final_evals.elo.match_builder.select_pilot_subset(bank: list[Prompt], n: int = 50, seed: int = 20260427) list[Prompt][source]

Draw a reproducible stratified subset of the prompt bank.

The default 50-prompt sample preserves the source mix of the full 143-prompt bank (24 medmcqa_open + 69 medqa + 50 curated) at PILOT_STRATA = {medmcqa_open: 8, medqa: 24, curated: 18}. The output is deterministic given seed (default 20260427).

Returned prompts follow the original bank ordering so downstream artifacts (matches.jsonl, pilot_summary.md) read consistently.

Parameters:
  • bank – Full prompt bank.

  • n – Total subset size. Currently must equal DEFAULT_PILOT_SIZE; adjusting the strata to a different n is a deliberate code change.

  • seed – NumPy Generator seed for the per-stratum sample.

Raises:

ValueError – If n is not the supported pilot size, or if any stratum lacks the required number of prompts.