olmo_tap.final_evals.elo¶

Tournament 1 Elo evaluation harness.

Configuration-level Elo for four entrants (base OLMo, Hydra+Security, Hydra+Security+Robustness, Hydra+PoE) with Claude as an LLM judge.

Modules

`elo_engine`	Permutation-averaged Elo with K-factor sensitivity sweep.
`entrants`	Entrant definitions for the configuration-level Elo tournament.
`generate`	Entrant -> response generation for the configuration-level Elo tournament.
`judge`	LLM-judge pipeline using the Anthropic Batch API with prompt caching.
`match_builder`	Pairwise match list construction for the configuration-level Elo run.
`prompts`	Prompt-bank construction for Tournament 1.
`report`	Tournament reporting helpers.
`run_tournament`	Tournament orchestrator: response cache → judges → Elo → artifacts.
`scripts`	One-off scripts for the Elo evaluation harness.
`types`	Shared, dependency-light data types for the Elo tournament pipeline.