olmo_tap.final_evals.elo.judge¶
LLM-judge pipeline using the Anthropic Batch API with prompt caching.
Given a list of (prompt, response_a, response_b) triples and a rubric,
this module returns swap-consistent verdicts deduplicated against an
on-disk judgment cache.
Pipeline¶
Build judge queries — every triple is judged twice, once in each position order, to mitigate position bias.
Filter against the on-disk cache. Cache keys are content hashes over the rubric file contents, judge model id, prompt id, the two entrant ids, the position order, and both response texts. Any change to the rubric file produces a new cache key, so rubric edits invalidate stale entries automatically.
Submit the remaining queries to
client.messages.batcheswith the system message + rubric text marked as a 1-hour cache breakpoint, so every request in the batch reads the cached prefix.Poll the batch with exponential backoff until it ends, then stream the results into the cache JSONL.
Reconcile the two position orders into a single
PairOutcome: consistent verdicts win; inconsistent ones become ties (and are dropped from the Elo match list upstream).
Source-handling asymmetry¶
The factuality rubric requires a gold answer; curated trustworthiness
prompts (source == "curated") carry no gold_answer, so they are
filtered out of the factuality match list. They are kept for
calibration and clinical_utility.
For the calibration rubric, curated prompts include their
expected_behavior tag in the user message so the judge can score
appropriate hedging / abstention; non-curated prompts omit that tag.
Functions
|
Append a single record to the dimension's cache JSONL. |
|
Compose the cacheable system message: a fixed header plus rubric text. |
|
Compose the per-pair user message. |
|
|
|
Return a 12-hex-char SHA256 prefix over the inputs. |
|
Drop pairs that should not be judged on the given dimension. |
|
Judge |
|
Load the JSONL cache for one dimension into a |
|
Pull the final |
|
Combine the two position-ordered raw judgements into one |
Classes
|
Aggregated token usage from a |
|
Knobs for one |
|
Output of |
|
Reconciled verdict for one |
|
One pair of responses to be judged on a given prompt and rubric. |
|
A loaded rubric: dimension, version header, and full file contents. |
- class olmo_tap.final_evals.elo.judge.CacheStats(cache_creation_input_tokens: int = 0, cache_read_input_tokens: int = 0, input_tokens: int = 0, output_tokens: int = 0, fresh_calls: int = 0, cache_hits: int = 0)[source]¶
Bases:
objectAggregated token usage from a
judge_pairsrun.
- class olmo_tap.final_evals.elo.judge.JudgeConfig(judge_model: str, cache_dir: Path, max_tokens: int = 1024, poll_initial_seconds: float = 30.0, poll_max_seconds: float = 300.0, submit_max_retries: int = 3, api_key: str | None = None)[source]¶
Bases:
objectKnobs for one
judge_pairscall.
- class olmo_tap.final_evals.elo.judge.JudgeResult(judgments: list[Judgment], cache_stats: CacheStats, rubric: Rubric)[source]¶
Bases:
objectOutput of
judge_pairs: reconciled judgments and run stats.- cache_stats: CacheStats¶
- class olmo_tap.final_evals.elo.judge.Judgment(prompt_id: str, dimension: Literal['factuality', 'calibration', 'clinical_utility'], entrant_a: str, entrant_b: str, winner: str | None, inconsistent: bool, raw: tuple[_RawJudgement, _RawJudgement])[source]¶
Bases:
objectReconciled verdict for one
(pair, prompt, dimension)triple.winnerisentrant_a/entrant_bon a consistent verdict,Nonefor a tie or inconsistent swap-pair (Elo treats these as ties).inconsistentisTrueonly when the two position orders disagreed.
- class olmo_tap.final_evals.elo.judge.PairToJudge(prompt_id: str, source: str, prompt_text: str, entrant_a: str, entrant_b: str, response_a: str, response_b: str, gold_answer: str | None = None, expected_behavior: str | None = None)[source]¶
Bases:
objectOne pair of responses to be judged on a given prompt and rubric.
- class olmo_tap.final_evals.elo.judge.Rubric(dimension: Literal['factuality', 'calibration', 'clinical_utility'], version: str, text: str, path: Path)[source]¶
Bases:
objectA loaded rubric: dimension, version header, and full file contents.
- olmo_tap.final_evals.elo.judge.build_system_message(rubric: Rubric) str[source]¶
Compose the cacheable system message: a fixed header plus rubric text.
The full rubric file (including its version header) is embedded so that any byte-level edit to the rubric invalidates the prompt cache, matching the cache-key behaviour. The header is intentionally substantive so the cached prefix exceeds the API’s minimum cacheable token count, making prompt caching economically meaningful within a batch.
- olmo_tap.final_evals.elo.judge.build_user_message(query: _JudgeQuery) str[source]¶
Compose the per-pair user message. Not cached — varies every request.
- olmo_tap.final_evals.elo.judge.cache_path_for(cache_dir: Path, dimension: Literal['factuality', 'calibration', 'clinical_utility']) Path[source]¶
- olmo_tap.final_evals.elo.judge.derive_cache_key(*, rubric_text: str, judge_model: str, prompt_id: str, entrant_a: str, entrant_b: str, position_swap: bool, response_a: str, response_b: str) str[source]¶
Return a 12-hex-char SHA256 prefix over the inputs.
The hash is taken over a JSON-encoded dict with sorted keys so the output is deterministic. Callers should pass the same values they would send to the judge — in particular,
response_ais whatever text appears in position A before swapping. The position order is captured separately viaposition_swap.
- olmo_tap.final_evals.elo.judge.filter_pairs_for_dimension(pairs: Iterable[PairToJudge], dimension: Literal['factuality', 'calibration', 'clinical_utility']) list[PairToJudge][source]¶
Drop pairs that should not be judged on the given dimension.
The factuality rubric needs a gold answer, which curated trustworthiness prompts do not have. Calibration and clinical_utility evaluate framing and are run on the full bank.
- olmo_tap.final_evals.elo.judge.judge_pairs(pairs: list[PairToJudge], rubric: Rubric, config: JudgeConfig, *, client: Any = None) JudgeResult[source]¶
Judge
pairsonrubricand return reconciled judgments.Loads the on-disk cache for the rubric’s dimension, filters out any pair-orderings already judged, submits the rest as one Anthropic batch with the rubric prefix marked as a 1-hour cache breakpoint, polls until the batch ends, writes new entries to the cache, and reconciles each pair’s two position orders into one
Judgment.clientis optional — passing one in is convenient for tests.
- olmo_tap.final_evals.elo.judge.load_cache(cache_dir: Path, dimension: Literal['factuality', 'calibration', 'clinical_utility']) dict[str, dict[str, Any]][source]¶
Load the JSONL cache for one dimension into a
cache_key -> recorddict.If the same cache_key appears twice (legacy / concurrent-write artefact), the later entry wins.