Application

The chat UI and FastAPI backend that wire the Hydra model and trust metrics together.

Hosted architecture

The backend runs on Modal as a hosted FastAPI app on managed GPUs, and Cloudflare Pages hosts the frontend at tap-al9.pages.dev. Which backend the frontend hits is controlled by VITE_API_BASE, so frontend-only work doesn’t need a local GPU or a running backend. See the top-level README for the Modal tasks and workspace details.

Quick start

[!TIP] Working on frontend only? Set VITE_API_BASE in app/frontend/.env to the hosted Modal URL and skip the backend steps below.

Prerequisites for the local backend

  • OLMo 2 7B weights on disk, with OLMO_WEIGHTS_DIR in your environment pointing at them (e.g. /vol/bitbucket/$USER/olmo-2-7b-instruct).

  • HF_TOKEN in your environment for the HF fallback path and for LLM-based claim decomposition.

Run it

# Start backend using pixi
pixi run -e cuda app-api
# Runs at http://localhost:8000

# Start frontend (separate terminal)
cp app/frontend/.env.example app/frontend/.env  # first time only
cd app/frontend
npm install
npm run dev
# Opens at http://localhost:5173

Click any of the example queries on the landing page to quickly test the flow.

Pipeline

Our metrics are calculated on each response as follows. Hydra generates the response with PoE verification, which gives us certified tokens and resampled alternatives for the security panel, plus the verifier ensemble’s per-token predictive entropy that drives the experimental token uncertainty heatmap. The response-level uncertainty signal is split by query type: for MCQs it comes from the Hydra uncertainty head (specifically trained for this task), and for free-text answers we compute Kernel Language Entropy over resampled generations using a ModernBERT NLI scorer. Robustness is driven by the robustness LoRA plus an adversarial suffix bank scored by the same NLI model, with suffixes generated by AmpleGCG. For the inference pipeline we have selected those that had the highest probability of inducing a change in response. MCQ vs free-text routing is itself a real BERT classifier. Claim decomposition uses an LLM (FActScore-style) with an NLTK sentence-split fallback, and responses render as rich markdown in the UI.

The per-claim confidence score shown inside each expanded claim reuses the same ModernBERT NLI scorer: each claim is scored as P(entailment) + 0.5 * P(neutral) with the response as premise and the claim as hypothesis, the single-sample degenerate case of SelfCheckGPT-NLI (Manakul et al., EMNLP 2023). A high score means the response actually supports the claim.

If the Hydra path is unavailable (weights missing, or hf=true passed), the backend serves the HF fallback and returns null/empty payloads for security, uncertainty, and robustness so the UI can degrade gracefully.