olmo_tap.experiments.uncertainty.single_head_eval¶
NOTE: this file is for testing the uncertainty head on a single LLM head. For the
equivalent file used for testing on the PoE Hydra aggregation, see olmo_tap/final_evals/uncertainty_sweep.py
Reliability-diagram eval for the uncertainty head.
For each robustness shard (0 through 8), run the uncertainty head over the MedMCQA validation fold via the two-pass procedure from engine.py::train, bin the predicted Q into equal-width bins, compute the empirical accuracy P per bin, and plot P vs Q with the y=x diagonal. Drops one PNG per shard.
- Intended Usage::
- pixi run -e cuda python -m olmo_tap.experiments.uncertainty.single_head_eval
–checkpoint olmo_tap/weights/uncertainty/checkpoint_final.pt
Functions
|
|
|
Mirror engine.py::train lines 61-126 under no_grad. |
|
|
|
MedMCQA validation fold with the same two-pass tokenization as training. |
|
|
|
- olmo_tap.experiments.uncertainty.single_head_eval.collect_predictions_for_shard(model: HydraTransformer, dataloader: DataLoader, target_token_ids: Tensor, t_config: TrainingConfig, device: str) tuple[Tensor, Tensor, float][source]¶
Mirror engine.py::train lines 61-126 under no_grad. Canonical ref: engine.py.
- olmo_tap.experiments.uncertainty.single_head_eval.get_letter_token_ids(tokenizer) list[int][source]¶
- olmo_tap.experiments.uncertainty.single_head_eval.load_validation_set(exp_config: ExperimentConfig, max_examples: int | None) tuple[DataLoader, list[int]][source]¶
MedMCQA validation fold with the same two-pass tokenization as training.