olmo_tap.final_evals.eloΒΆ
Tournament 1 Elo evaluation harness.
Configuration-level Elo for four entrants (base OLMo, Hydra+Security, Hydra+Security+Robustness, Hydra+PoE) with Claude as an LLM judge.
Modules
Permutation-averaged Elo with K-factor sensitivity sweep. |
|
Entrant definitions for the configuration-level Elo tournament. |
|
Entrant -> response generation for the configuration-level Elo tournament. |
|
LLM-judge pipeline using the Anthropic Batch API with prompt caching. |
|
Pairwise match list construction for the configuration-level Elo run. |
|
Prompt-bank construction for Tournament 1. |
|
Tournament reporting helpers. |
|
Tournament orchestrator: response cache β judges β Elo β artifacts. |
|
One-off scripts for the Elo evaluation harness. |
|
Shared, dependency-light data types for the Elo tournament pipeline. |