olmo_tap.final_evals.elo.generate¶
Entrant -> response generation for the configuration-level Elo tournament.
Each EntrantSpec is materialised on GPU via
build_entrant(), then fed the prompt bank one prompt at a time.
The vanilla-HF entrant uses greedy decoding directly through
model.generate; the Hydra entrants route through
PoE.generate_with_cache() with per-prompt seeding so the random
draft-head selection lines up across the three Hydra entrants on each
prompt while still varying across prompts.
Responses are persisted line-by-line to
caches/responses/responses_<entrant_id>.jsonl so partial runs are
resumable: re-running the script picks up only the cache misses.
Usage:
pixi run -e cuda python -m olmo_tap.final_evals.elo.generate \
--bank olmo_tap/final_evals/elo/prompts/bank.jsonl \
--entrants base_olmo,security_only,security_plus_robustness,full_poe \
--cache-dir olmo_tap/final_evals/elo/caches/responses
Functions
|
Generate (or recover from cache) one response per prompt for an entrant. |
|
|
|
Drive generation across all entrants with shared model loads. |
- olmo_tap.final_evals.elo.generate.generate_responses_for_entrant(spec: EntrantSpec, loaded: LoadedEntrant, prompts: list[Prompt], cache_dir: Path, max_new_tokens: int = 256) list[GeneratedResponse][source]¶
Generate (or recover from cache) one response per prompt for an entrant.
Cache misses are appended to the per-entrant JSONL immediately so a SIGINT / OOM mid-sweep loses at most the in-flight prompt; a re-run picks up where the file left off.
- olmo_tap.final_evals.elo.generate.run_generation(specs: list[EntrantSpec], prompts: list[Prompt], cache_dir: Path, max_new_tokens: int = 256) dict[str, list[GeneratedResponse]][source]¶
Drive generation across all entrants with shared model loads.
Entrants with the same
(loader, rob_checkpoint)share one loaded model — only the eval-mode kwargs (bypass_jury,temperature) differ between them. Loads happen group-by-group; GPU memory is explicitly released between groups so the peak footprint is one model at a time.