olmo_tap.final_evals.elo.match_builder¶
Pairwise match list construction for the configuration-level Elo run.
Given the prompt bank, the per-entrant response cache, and the entrant
roster, this module produces the PairToJudge payloads consumed by
the LLM-judge pipeline. One PairToJudge is produced per
(prompt, unordered entrant pair, dimension) triple; both position
orderings (A-vs-B and B-vs-A) are generated inside judge_pairs()
itself, so the builder is concerned with unique pairs only.
Two filtering rules are baked in:
Factuality requires a gold answer, so curated trustworthiness prompts (
source == "curated") are dropped from that dimension only. Calibration and clinical_utility evaluate framing and run on the full bank.Prompts whose responses are missing for any participating entrant are dropped with a warning, so a partially-populated response cache still produces a coherent (smaller) match list rather than tripping a
KeyErrordeep inside the judge call.
A second helper, select_pilot_subset(), draws a stratified
50-prompt subset for the pilot run.
Functions
|
Construct the per-dimension pair lists fed to the LLM judge. |
|
Draw a reproducible stratified subset of the prompt bank. |
- olmo_tap.final_evals.elo.match_builder.build_match_list(bank: list[Prompt], response_cache: dict[tuple[str, str], GeneratedResponse], entrants: list[str], dimensions: Iterable[Literal['factuality', 'calibration', 'clinical_utility']] | None = None) dict[Literal['factuality', 'calibration', 'clinical_utility'], list[PairToJudge]][source]¶
Construct the per-dimension pair lists fed to the LLM judge.
For each unordered pair of distinct entrants, each prompt in
bank(after source filtering), and each dimension indimensions, onePairToJudgeis produced. The canonical(entrant_a, entrant_b)ordering follows the input order ofentrants; the matching B-vs-A query is fired byjudge_pairs()internally and is not represented here.Curated prompts (
source == "curated") are excluded fromfactualitybecause they have no gold answer; they are kept forcalibrationandclinical_utility.- Parameters:
bank – Prompt bank loaded via
generate.load_prompt_bank().response_cache – Mapping
(entrant_id, prompt_id) → response. Prompts missing any entrant’s response are skipped with a warning.entrants – Ordered list of entrant ids. Determines pair generation order via
itertools.combinations(). Must contain at least two unique ids.dimensions – Dimensions to build pair lists for. Defaults to all three (
factuality,calibration,clinical_utility).
- Returns:
{dimension: [PairToJudge, ...]}. Empty list for any dimension that has no surviving prompts.
- olmo_tap.final_evals.elo.match_builder.select_pilot_subset(bank: list[Prompt], n: int = 50, seed: int = 20260427) list[Prompt][source]¶
Draw a reproducible stratified subset of the prompt bank.
The default 50-prompt sample preserves the source mix of the full 143-prompt bank (24 medmcqa_open + 69 medqa + 50 curated) at
PILOT_STRATA = {medmcqa_open: 8, medqa: 24, curated: 18}. The output is deterministic givenseed(default20260427).Returned prompts follow the original bank ordering so downstream artifacts (matches.jsonl, pilot_summary.md) read consistently.
- Parameters:
bank – Full prompt bank.
n – Total subset size. Currently must equal
DEFAULT_PILOT_SIZE; adjusting the strata to a differentnis a deliberate code change.seed – NumPy
Generatorseed for the per-stratum sample.
- Raises:
ValueError – If
nis not the supported pilot size, or if any stratum lacks the required number of prompts.