app.backend.modal_app

Modal deployment wrapper for the Hydra inference backend.

Two Modal entry points live here:

  • download_weights() (one-off) – populates the tap-olmo-weights Volume with OLMo-2-7B weights and a warm HF cache for ModernBERT-NLI. Run once per weights bump via modal run app/backend/modal_app.py::download_weights.

  • HydraBackend (always-on, GPU-backed) – preloads Hydra in the @modal.enter() hook (so the FastAPI lifespan can skip the load) and serves app.backend.server.app as an ASGI app.

The container image is built from the project’s pixi.lock so the runtime CUDA/torch/flash-attn stack matches local dev exactly. The CONDA_OVERRIDE_CUDA=12.4 env var is required because the build container has no GPU and pixi’s __cuda virtual package check would otherwise fail; runtime containers get a real GPU from Modal.