olmo_tap.experiments.security.eval

Evaluate a model on MedMCQA classification accuracy.

Usage:

# Evaluate base OLMo (no finetuning)
pixi run python -m experiments.security.eval --base

# Evaluate a finetuned checkpoint
pixi run python -m experiments.security.eval --checkpoint path/to/checkpoint_final.pt

Functions

evaluate(model, tokenizer, dataset, ...)

Run classification eval, return accuracy metrics.

format_question(question, mcq_options)

Wrap a raw MedMCQA question with preamble.

get_mcq_logits(logits, token_ids)

main()

parse_args()

olmo_tap.experiments.security.eval.evaluate(model, tokenizer, dataset, token_ids: list[int], batch_size: int, max_seq_len: int, device: str) dict[source]

Run classification eval, return accuracy metrics.

olmo_tap.experiments.security.eval.format_question(question: str, mcq_options: list[str]) str[source]

Wrap a raw MedMCQA question with preamble.

olmo_tap.experiments.security.eval.get_mcq_logits(logits: Tensor, token_ids: list[int]) Tensor[source]
olmo_tap.experiments.security.eval.main()[source]
olmo_tap.experiments.security.eval.parse_args() Namespace[source]