olmo_tap.benchmarks.plotting

Plot the three-config benchmark output produced by olmo_tap.benchmarks.inference.

A single figure with three subplots:

  1. TTFT distribution — KDE + histogram of prefill (time-to-first-token) timings for baseline and hydra_naive. PoE is excluded because prefill is bundled inside its full-generation call.

  2. Decode latency vs KV position — per-step decode latency (ms) over the benchmarked KV positions. PoE is rendered as horizontal dashed lines, one per γ, at its per-token equivalent latency (call median / accepted tokens) so it sits on the same axis as the per-step rows.

  3. TPS vs KV position — tokens per second, same shape. PoE rendered as horizontal dashed lines at its effective TPS per γ.

Colour convention: orange = baseline, blue = naive Hydra, green family = PoE (darker greens for larger γ).

Usually invoked indirectly — olmo_tap.benchmarks.inference.main() calls plot_results() after writing results.json.

Functions

plot_decode_curve(ax_latency, ax_tps, ...)

plot_histogram_kde(ax, timings, label, color)

plot_results(results, output_dir)

Render the three-config comparison figure to graph.png.

olmo_tap.benchmarks.plotting.plot_decode_curve(ax_latency, ax_tps, decode_results, label, color)[source]
olmo_tap.benchmarks.plotting.plot_histogram_kde(ax, timings, label, color)[source]
olmo_tap.benchmarks.plotting.plot_results(results, output_dir)[source]

Render the three-config comparison figure to graph.png.

Parameters:
  • results – Benchmark results dict as written to results.json by olmo_tap.benchmarks.inference.main(). Recognised top-level keys: baseline, hydra_naive, hydra_poe. Missing keys are skipped silently — a partial run still plots.

  • output_dir – Directory to write graph.png into. Must exist.