olmo_tap.final_evals.elo.judge

LLM-judge pipeline using the Anthropic Batch API with prompt caching.

Given a list of (prompt, response_a, response_b) triples and a rubric, this module returns swap-consistent verdicts deduplicated against an on-disk judgment cache.

Pipeline

  1. Build judge queries — every triple is judged twice, once in each position order, to mitigate position bias.

  2. Filter against the on-disk cache. Cache keys are content hashes over the rubric file contents, judge model id, prompt id, the two entrant ids, the position order, and both response texts. Any change to the rubric file produces a new cache key, so rubric edits invalidate stale entries automatically.

  3. Submit the remaining queries to client.messages.batches with the system message + rubric text marked as a 1-hour cache breakpoint, so every request in the batch reads the cached prefix.

  4. Poll the batch with exponential backoff until it ends, then stream the results into the cache JSONL.

  5. Reconcile the two position orders into a single PairOutcome: consistent verdicts win; inconsistent ones become ties (and are dropped from the Elo match list upstream).

Source-handling asymmetry

The factuality rubric requires a gold answer; curated trustworthiness prompts (source == "curated") carry no gold_answer, so they are filtered out of the factuality match list. They are kept for calibration and clinical_utility.

For the calibration rubric, curated prompts include their expected_behavior tag in the user message so the judge can score appropriate hedging / abstention; non-curated prompts omit that tag.

Functions

append_cache_record(cache_dir, dimension, record)

Append a single record to the dimension's cache JSONL.

build_system_message(rubric)

Compose the cacheable system message: a fixed header plus rubric text.

build_user_message(query)

Compose the per-pair user message.

cache_path_for(cache_dir, dimension)

derive_cache_key(*, rubric_text, ...)

Return a 12-hex-char SHA256 prefix over the inputs.

filter_pairs_for_dimension(pairs, dimension)

Drop pairs that should not be judged on the given dimension.

judge_pairs(pairs, rubric, config, *[, client])

Judge pairs on rubric and return reconciled judgments.

load_cache(cache_dir, dimension)

Load the JSONL cache for one dimension into a cache_key -> record dict.

parse_verdict(reply_text)

Pull the final VERDICT: X line out of the model's reply.

reconcile_swap(forward, swapped, dimension)

Combine the two position-ordered raw judgements into one Judgment.

Classes

CacheStats([cache_creation_input_tokens, ...])

Aggregated token usage from a judge_pairs run.

JudgeConfig(judge_model, cache_dir[, ...])

Knobs for one judge_pairs call.

JudgeResult(judgments, cache_stats, rubric)

Output of judge_pairs: reconciled judgments and run stats.

Judgment(prompt_id, dimension, entrant_a, ...)

Reconciled verdict for one (pair, prompt, dimension) triple.

PairToJudge(prompt_id, source, prompt_text, ...)

One pair of responses to be judged on a given prompt and rubric.

Rubric(dimension, version, text, path)

A loaded rubric: dimension, version header, and full file contents.

class olmo_tap.final_evals.elo.judge.CacheStats(cache_creation_input_tokens: int = 0, cache_read_input_tokens: int = 0, input_tokens: int = 0, output_tokens: int = 0, fresh_calls: int = 0, cache_hits: int = 0)[source]

Bases: object

Aggregated token usage from a judge_pairs run.

cache_creation_input_tokens: int = 0
cache_hits: int = 0
cache_read_input_tokens: int = 0
fresh_calls: int = 0
input_tokens: int = 0
output_tokens: int = 0
class olmo_tap.final_evals.elo.judge.JudgeConfig(judge_model: str, cache_dir: Path, max_tokens: int = 1024, poll_initial_seconds: float = 30.0, poll_max_seconds: float = 300.0, submit_max_retries: int = 3, api_key: str | None = None)[source]

Bases: object

Knobs for one judge_pairs call.

api_key: str | None = None
cache_dir: Path
judge_model: str
max_tokens: int = 1024
poll_initial_seconds: float = 30.0
poll_max_seconds: float = 300.0
submit_max_retries: int = 3
class olmo_tap.final_evals.elo.judge.JudgeResult(judgments: list[Judgment], cache_stats: CacheStats, rubric: Rubric)[source]

Bases: object

Output of judge_pairs: reconciled judgments and run stats.

cache_stats: CacheStats
judgments: list[Judgment]
rubric: Rubric
class olmo_tap.final_evals.elo.judge.Judgment(prompt_id: str, dimension: Literal['factuality', 'calibration', 'clinical_utility'], entrant_a: str, entrant_b: str, winner: str | None, inconsistent: bool, raw: tuple[_RawJudgement, _RawJudgement])[source]

Bases: object

Reconciled verdict for one (pair, prompt, dimension) triple.

winner is entrant_a / entrant_b on a consistent verdict, None for a tie or inconsistent swap-pair (Elo treats these as ties). inconsistent is True only when the two position orders disagreed.

dimension: Literal['factuality', 'calibration', 'clinical_utility']
entrant_a: str
entrant_b: str
inconsistent: bool
prompt_id: str
raw: tuple[_RawJudgement, _RawJudgement]
winner: str | None
class olmo_tap.final_evals.elo.judge.PairToJudge(prompt_id: str, source: str, prompt_text: str, entrant_a: str, entrant_b: str, response_a: str, response_b: str, gold_answer: str | None = None, expected_behavior: str | None = None)[source]

Bases: object

One pair of responses to be judged on a given prompt and rubric.

entrant_a: str
entrant_b: str
expected_behavior: str | None = None
gold_answer: str | None = None
prompt_id: str
prompt_text: str
response_a: str
response_b: str
source: str
class olmo_tap.final_evals.elo.judge.Rubric(dimension: Literal['factuality', 'calibration', 'clinical_utility'], version: str, text: str, path: Path)[source]

Bases: object

A loaded rubric: dimension, version header, and full file contents.

dimension: Literal['factuality', 'calibration', 'clinical_utility']
classmethod load(dimension: Literal['factuality', 'calibration', 'clinical_utility'], path: Path) Rubric[source]

Load a rubric file and parse its # version: header.

path: Path
text: str
version: str
olmo_tap.final_evals.elo.judge.build_system_message(rubric: Rubric) str[source]

Compose the cacheable system message: a fixed header plus rubric text.

The full rubric file (including its version header) is embedded so that any byte-level edit to the rubric invalidates the prompt cache, matching the cache-key behaviour. The header is intentionally substantive so the cached prefix exceeds the API’s minimum cacheable token count, making prompt caching economically meaningful within a batch.

olmo_tap.final_evals.elo.judge.build_user_message(query: _JudgeQuery) str[source]

Compose the per-pair user message. Not cached — varies every request.

olmo_tap.final_evals.elo.judge.cache_path_for(cache_dir: Path, dimension: Literal['factuality', 'calibration', 'clinical_utility']) Path[source]
olmo_tap.final_evals.elo.judge.derive_cache_key(*, rubric_text: str, judge_model: str, prompt_id: str, entrant_a: str, entrant_b: str, position_swap: bool, response_a: str, response_b: str) str[source]

Return a 12-hex-char SHA256 prefix over the inputs.

The hash is taken over a JSON-encoded dict with sorted keys so the output is deterministic. Callers should pass the same values they would send to the judge — in particular, response_a is whatever text appears in position A before swapping. The position order is captured separately via position_swap.

olmo_tap.final_evals.elo.judge.filter_pairs_for_dimension(pairs: Iterable[PairToJudge], dimension: Literal['factuality', 'calibration', 'clinical_utility']) list[PairToJudge][source]

Drop pairs that should not be judged on the given dimension.

The factuality rubric needs a gold answer, which curated trustworthiness prompts do not have. Calibration and clinical_utility evaluate framing and are run on the full bank.

olmo_tap.final_evals.elo.judge.judge_pairs(pairs: list[PairToJudge], rubric: Rubric, config: JudgeConfig, *, client: Any = None) JudgeResult[source]

Judge pairs on rubric and return reconciled judgments.

Loads the on-disk cache for the rubric’s dimension, filters out any pair-orderings already judged, submits the rest as one Anthropic batch with the rubric prefix marked as a 1-hour cache breakpoint, polls until the batch ends, writes new entries to the cache, and reconciles each pair’s two position orders into one Judgment.

client is optional — passing one in is convenient for tests.

olmo_tap.final_evals.elo.judge.load_cache(cache_dir: Path, dimension: Literal['factuality', 'calibration', 'clinical_utility']) dict[str, dict[str, Any]][source]

Load the JSONL cache for one dimension into a cache_key -> record dict.

If the same cache_key appears twice (legacy / concurrent-write artefact), the later entry wins.

olmo_tap.final_evals.elo.judge.parse_verdict(reply_text: str) Literal['A', 'B', 'TIE'][source]

Pull the final VERDICT: X line out of the model’s reply.

Falls back to TIE if the model emits something we cannot parse — callers that want stricter behaviour should inspect the reasoning text directly.

olmo_tap.final_evals.elo.judge.reconcile_swap(forward: _RawJudgement, swapped: _RawJudgement, dimension: Literal['factuality', 'calibration', 'clinical_utility']) Judgment[source]

Combine the two position-ordered raw judgements into one Judgment.