olmo_tap.experiments.uncertainty.data¶

Data loading for uncertainty head supervised finetuning on MedMCQA.

Functions

`encode_second_pass`(tokenizer, pre, ans, ...)	Tokenize second pass
`format_first_pass`(question, mcq_options)	Wrap a raw MedMCQA question with preamble.
`format_second_pass`(pre, ans)
`load_shard`(config)
`preprocess_example`(example, tokenizer, ...)	Pre-tokenize first pass and all 4 second pass variants.

olmo_tap.experiments.uncertainty.data.encode_second_pass(tokenizer: TokenizersBackend | SentencePieceBackend, pre: str, ans: str, max_seq_len: int) → dict[source]¶: Tokenize second pass

olmo_tap.experiments.uncertainty.data.format_first_pass(question: str, mcq_options: list[str]) → str[source]¶: Wrap a raw MedMCQA question with preamble.

olmo_tap.experiments.uncertainty.data.format_second_pass(pre: str, ans: str) → str[source]¶

olmo_tap.experiments.uncertainty.data.load_shard(config: ExperimentConfig) → tuple[DataLoader, int, int, int, int][source]¶

olmo_tap.experiments.uncertainty.data.preprocess_example(example: dict[str, str], tokenizer: TokenizersBackend | SentencePieceBackend, max_seq_len: int, token_ids: list[int]) → dict[source]¶

Pre-tokenize first pass and all 4 second pass variants.

We don’t know which answer the model will pick until runtime, so we tokenize all 4 possibilities here. The training loop uses torch.where to select the right one after the first forward pass.