olmo_tap.experiments.uncertainty.data

Data loading for uncertainty head supervised finetuning on MedMCQA.

Functions

encode_second_pass(tokenizer, pre, ans, ...)

Tokenize second pass

format_first_pass(question, mcq_options)

Wrap a raw MedMCQA question with preamble.

format_second_pass(pre, ans)

load_shard(config)

preprocess_example(example, tokenizer, ...)

Pre-tokenize first pass and all 4 second pass variants.

olmo_tap.experiments.uncertainty.data.encode_second_pass(tokenizer: TokenizersBackend | SentencePieceBackend, pre: str, ans: str, max_seq_len: int) dict[source]

Tokenize second pass

olmo_tap.experiments.uncertainty.data.format_first_pass(question: str, mcq_options: list[str]) str[source]

Wrap a raw MedMCQA question with preamble.

olmo_tap.experiments.uncertainty.data.format_second_pass(pre: str, ans: str) str[source]
olmo_tap.experiments.uncertainty.data.load_shard(config: ExperimentConfig) tuple[DataLoader, int, int, int, int][source]
olmo_tap.experiments.uncertainty.data.preprocess_example(example: dict[str, str], tokenizer: TokenizersBackend | SentencePieceBackend, max_seq_len: int, token_ids: list[int]) dict[source]

Pre-tokenize first pass and all 4 second pass variants.

We don’t know which answer the model will pick until runtime, so we tokenize all 4 possibilities here. The training loop uses torch.where to select the right one after the first forward pass.