olmo_tap.benchmarks.plotting¶

Plot the three-config benchmark output produced by olmo_tap.benchmarks.inference.

A single figure with three subplots:

TTFT distribution — KDE + histogram of prefill (time-to-first-token) timings for baseline and hydra_naive. PoE is excluded because prefill is bundled inside its full-generation call.
Decode latency vs KV position — per-step decode latency (ms) over the benchmarked KV positions. PoE is rendered as horizontal dashed lines, one per γ, at its per-token equivalent latency (call median / accepted tokens) so it sits on the same axis as the per-step rows.
TPS vs KV position — tokens per second, same shape. PoE rendered as horizontal dashed lines at its effective TPS per γ.

Colour convention: orange = baseline, blue = naive Hydra, green family = PoE (darker greens for larger γ).

Usually invoked indirectly — olmo_tap.benchmarks.inference.main() calls plot_results() after writing results.json.

Functions

`plot_decode_curve`(ax_latency, ax_tps, ...)
`plot_histogram_kde`(ax, timings, label, color)
`plot_results`(results, output_dir)	Render the three-config comparison figure to `graph.png`.

olmo_tap.benchmarks.plotting.plot_decode_curve(ax_latency, ax_tps, decode_results, label, color)[source]¶

olmo_tap.benchmarks.plotting.plot_histogram_kde(ax, timings, label, color)[source]¶

olmo_tap.benchmarks.plotting.plot_results(results, output_dir)[source]¶

Render the three-config comparison figure to graph.png.

Parameters:

results – Benchmark results dict as written to results.json by olmo_tap.benchmarks.inference.main(). Recognised top-level keys: baseline, hydra_naive, hydra_poe. Missing keys are skipped silently — a partial run still plots.
output_dir – Directory to write graph.png into. Must exist.