app.backend.modal_app¶
Modal deployment wrapper for the Hydra inference backend.
Two Modal entry points live here:
download_weights()(one-off) – populates thetap-olmo-weightsVolume with OLMo-2-7B weights and a warm HF cache for ModernBERT-NLI. Run once per weights bump viamodal run app/backend/modal_app.py::download_weights.HydraBackend(always-on, GPU-backed) – preloads Hydra in the@modal.enter()hook (so the FastAPI lifespan can skip the load) and servesapp.backend.server.appas an ASGI app.
The container image is built from the project’s pixi.lock so the
runtime CUDA/torch/flash-attn stack matches local dev exactly. The
CONDA_OVERRIDE_CUDA=12.4 env var is required because the build
container has no GPU and pixi’s __cuda virtual package check would
otherwise fail; runtime containers get a real GPU from Modal.