# Architecture A one-screen map of how the pieces fit together. ![OhLMo with PoE Speculative Verification.](../_static/architecture.png) *OhLMo with PoE Speculative Verification. A shared trunk feeds K parallel LoRA-tuned heads (nine inference heads and one dedicated uncertainty head), with a draft head proposing tokens that the remaining heads verify as a Product-of-Experts.* ## Layer by layer **`olmo_tap/`** is the model-side core. `hydra.py` defines the `HydraTransformer`, a shared OLMo-2-7B trunk with K parallel LoRA-tuned heads. `inference/poe.py` composes the heads at decode time via Product-of-Experts Speculative Verification, where a draft head proposes tokens and the remaining heads verify them as a PoE jury. `experiments/` holds the three post-training pipelines (security on disjoint MedMCQA shards, KL-based robustness against AmpleGCG suffixes, and a per-answer uncertainty head with residual-stream injection). `benchmarks/` and `final_evals/` reproduce the decode-throughput and accuracy/calibration numbers from the report. LoRA adapter shards live in `weights/` as Git LFS objects. **`kernel_entropy/`** measures semantic uncertainty over free-text generations using Kernel Language Entropy ([Nikitin et al. 2024](https://arxiv.org/abs/2405.20003)). It samples N responses from the Hydra+PoE generator, scores pairwise entailment with a ModernBERT NLI head, builds a similarity kernel, and returns a scalar Von Neumann entropy. The same NLI scorer is reused inside `app/backend/` for per-claim confidence (single-sample SelfCheckGPT-NLI). **`app/`** is the user-facing surface. The backend is a FastAPI app deployed on Modal with managed GPUs; it serves Hydra+PoE inference, runs claim decomposition, and assembles the response payload with the three trust signals. The frontend is a React+Vite SPA on Cloudflare Pages that targets whichever backend URL `VITE_API_BASE` points at, so frontend-only contributors do not need a local GPU. ## Trust signals - **Uncertainty** is split by query type. For MCQs it comes from the dedicated Hydra uncertainty head, trained against MCQ correctness with residual-stream injection from a frozen LLM head. For free-text answers it comes from KLE over resampled generations. - **Security** is delivered by the disjoint-shard post-train across the nine LLM heads. PoE Speculative Verification at decode time turns this into a per-token one-honest-head guarantee, and the verifier ensemble's per-token predictive entropy drives an optional uncertainty heatmap in the UI. - **Robustness** is delivered by a KL-based post-train against adversarial suffixes generated by AmpleGCG, plus a runtime probe that re-scores the response under a precomputed attack bank using the same ModernBERT NLI model. ## Deployment topology - **Local**: `olmo_tap` and `kernel_entropy` run via `pixi run -e cuda ...`. Weights live in `$OLMO_WEIGHTS_DIR`, set in `.env`. - **Hosted backend**: `app/backend/modal_app.py` builds its image from the same `pixi install -e cuda --locked` used locally, so Modal and dev share one source of truth for dependencies. The `tap-olmo-weights` Modal Volume holds the OLMo snapshot and the ModernBERT cache. - **Hosted frontend**: Cloudflare Pages builds `app/frontend/` and serves the demo at [tap-al9.pages.dev](https://tap-al9.pages.dev/). ## Where to look next - {doc}`olmo-tap` — package map for the model core. - {doc}`kernel-entropy` — KLE pipeline usage. - {doc}`app` — application stack and pipeline. - {doc}`../api/index` — auto-generated API reference.