olmo_tap.experiments.security.eval¶
Evaluate a model on MedMCQA classification accuracy.
Usage:
# Evaluate base OLMo (no finetuning)
pixi run python -m experiments.security.eval --base
# Evaluate a finetuned checkpoint
pixi run python -m experiments.security.eval --checkpoint path/to/checkpoint_final.pt
Functions
|
Run classification eval, return accuracy metrics. |
|
Wrap a raw MedMCQA question with preamble. |
|
|
|
|
- olmo_tap.experiments.security.eval.evaluate(model, tokenizer, dataset, token_ids: list[int], batch_size: int, max_seq_len: int, device: str) dict[source]¶
Run classification eval, return accuracy metrics.
- olmo_tap.experiments.security.eval.format_question(question: str, mcq_options: list[str]) str[source]¶
Wrap a raw MedMCQA question with preamble.