olmo_tap.final_evals.elo.generate¶

Entrant -> response generation for the configuration-level Elo tournament.

Each EntrantSpec is materialised on GPU via build_entrant(), then fed the prompt bank one prompt at a time. The vanilla-HF entrant uses greedy decoding directly through model.generate; the Hydra entrants route through PoE.generate_with_cache() with per-prompt seeding so the random draft-head selection lines up across the three Hydra entrants on each prompt while still varying across prompts.

Responses are persisted line-by-line to caches/responses/responses_<entrant_id>.jsonl so partial runs are resumable: re-running the script picks up only the cache misses.

Usage:

pixi run -e cuda python -m olmo_tap.final_evals.elo.generate \
    --bank olmo_tap/final_evals/elo/prompts/bank.jsonl \
    --entrants base_olmo,security_only,security_plus_robustness,full_poe \
    --cache-dir olmo_tap/final_evals/elo/caches/responses

Functions

`generate_responses_for_entrant`(spec, loaded, ...)	Generate (or recover from cache) one response per prompt for an entrant.
`main`([argv])
`run_generation`(specs, prompts, cache_dir[, ...])	Drive generation across all entrants with shared model loads.

olmo_tap.final_evals.elo.generate.generate_responses_for_entrant(spec: EntrantSpec, loaded: LoadedEntrant, prompts: list[Prompt], cache_dir: Path, max_new_tokens: int = 256) → list[GeneratedResponse][source]¶

Generate (or recover from cache) one response per prompt for an entrant.

Cache misses are appended to the per-entrant JSONL immediately so a SIGINT / OOM mid-sweep loses at most the in-flight prompt; a re-run picks up where the file left off.

olmo_tap.final_evals.elo.generate.main(argv: list[str] | None = None) → None[source]¶

olmo_tap.final_evals.elo.generate.run_generation(specs: list[EntrantSpec], prompts: list[Prompt], cache_dir: Path, max_new_tokens: int = 256) → dict[str, list[GeneratedResponse]][source]¶

Drive generation across all entrants with shared model loads.

Entrants with the same (loader, rob_checkpoint) share one loaded model — only the eval-mode kwargs (bypass_jury, temperature) differ between them. Loads happen group-by-group; GPU memory is explicitly released between groups so the peak footprint is one model at a time.