olmo_tap.experiments.uncertainty.data¶
Data loading for uncertainty head supervised finetuning on MedMCQA.
Functions
|
Tokenize second pass |
|
Wrap a raw MedMCQA question with preamble. |
|
|
|
|
|
Pre-tokenize first pass and all 4 second pass variants. |
- olmo_tap.experiments.uncertainty.data.encode_second_pass(tokenizer: TokenizersBackend | SentencePieceBackend, pre: str, ans: str, max_seq_len: int) dict[source]¶
Tokenize second pass
- olmo_tap.experiments.uncertainty.data.format_first_pass(question: str, mcq_options: list[str]) str[source]¶
Wrap a raw MedMCQA question with preamble.
- olmo_tap.experiments.uncertainty.data.load_shard(config: ExperimentConfig) tuple[DataLoader, int, int, int, int][source]¶
- olmo_tap.experiments.uncertainty.data.preprocess_example(example: dict[str, str], tokenizer: TokenizersBackend | SentencePieceBackend, max_seq_len: int, token_ids: list[int]) dict[source]¶
Pre-tokenize first pass and all 4 second pass variants.
We don’t know which answer the model will pick until runtime, so we tokenize all 4 possibilities here. The training loop uses torch.where to select the right one after the first forward pass.