olmo_tap.final_evals.eloΒΆ

Tournament 1 Elo evaluation harness.

Configuration-level Elo for four entrants (base OLMo, Hydra+Security, Hydra+Security+Robustness, Hydra+PoE) with Claude as an LLM judge.

Modules

elo_engine

Permutation-averaged Elo with K-factor sensitivity sweep.

entrants

Entrant definitions for the configuration-level Elo tournament.

generate

Entrant -> response generation for the configuration-level Elo tournament.

judge

LLM-judge pipeline using the Anthropic Batch API with prompt caching.

match_builder

Pairwise match list construction for the configuration-level Elo run.

prompts

Prompt-bank construction for Tournament 1.

report

Tournament reporting helpers.

run_tournament

Tournament orchestrator: response cache β†’ judges β†’ Elo β†’ artifacts.

scripts

One-off scripts for the Elo evaluation harness.

types

Shared, dependency-light data types for the Elo tournament pipeline.