Add configurable answer LLM generation to service mode by charlesbluca · Pull Request #2202 · NVIDIA/NeMo-Retriever

charlesbluca · 2026-06-02T18:24:22Z

Description

Adds configurable answer-generation LLM support to the retriever service and Helm chart. The optional operator-managed nimOperator.answer_llm block defaults to the Super-49B NIM, but its image, model id, model profile, resources, service name, and env can be overridden for other compatible LLM NIMs such as Nemotron/Nano 3 30B. The same service can also point at an external OpenAI-compatible endpoint through serviceConfig.llm.

Changes include:

POST /v1/answer in the service: VectorDB retrieval followed by LiteLLM answer generation, with lazy client caching, 404/502 error handling, and independent include_chunks / include_metadata response flags.
Generic Helm answer_llm NIMCache/NIMService rendering, including pull-secret validation, Super-49B defaults, Nano override coverage, and automatic in-cluster api_base / model wiring.
Secret-safe LLM auth: ConfigMap renders api_key: null; the runtime API key is injected from NEMO_RETRIEVER_LLM_API_KEY via either explicit serviceConfig.llm.apiKeySecret or the operator NIM auth secret.
Unified reasoning control: reasoning_enabled=False applies the shared no-reasoning request metadata/prompt controls used by the Super-49B and Nano NIMs, while generic library clients default to preserving provider behavior.
Stable litellm>=1.86.0,<2 service dependency plus focused unit/Helm tests for service config, answer generation, reasoning, packaging, and model swaps.

Validation:

uv run --project nemo_retriever pytest -q nemo_retriever/tests/test_llm_params.py nemo_retriever/tests/test_service_answer_generation.py nemo_retriever/tests/test_helm_answer_llm_generation.py nemo_retriever/tests/test_live_rag.py nemo_retriever/tests/test_service_pipeline_spec.py nemo_retriever/tests/test_service_packaging.py
uv run pre-commit run --all-files
PR CI green: pre-commit, retriever unit tests, slim import contract, library-mode installs, Docker build/test, and PR validation.
Manual smoke covered Helm answer generation and standalone Docker comparison of Super-49B/Nano reasoning controls.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

greptile-apps · 2026-06-02T18:46:54Z

Greptile Summary

This PR adds configurable answer-generation LLM support to the retriever service, wiring POST /v1/answer (VectorDB retrieval + LiteLLM generation) through a new LLMConfig, a Helm answer_llm NIMCache/NIMService template, and secret-safe API key injection via NEMO_RETRIEVER_LLM_API_KEY.

POST /v1/answer in ingest.py: lazy-cached LiteLLMClient, VectorDB proxy with 502 surfacing on failure, 404 guards for disabled VectorDB/LLM and worker pods, and per-request reasoning_enabled override.
litellm.py reasoning controls: reasoning_enabled now defaults True (preserving provider behaviour); chat_template_kwargs and /no_think directives are only injected when explicitly set to False, fixing the previously flagged default-breaking-non-Nemotron issue.
Helm wiring: answer-llm.yaml NIMCache/NIMService, configmap.yaml auto-wires operator URL/model when nimOperator.answer_llm.enabled: true, deployment.yaml injects NEMO_RETRIEVER_LLM_API_KEY from a Secret reference; api_key: null is rendered in the ConfigMap, addressing the prior plaintext-key concern.

Confidence Score: 4/5

Safe to merge after addressing the empty-chunk passthrough and missing query length bound; the core flow (secret injection, operator URL gating, reasoning defaults) is correct.

The LLM API key is properly kept out of the ConfigMap and injected via Secret-backed env vars. The nemo-retriever.nimOperator.url helper correctly gates on enabled so no phantom URL is generated when the answer LLM is disabled. The reasoning_enabled default flip to True fixes the previously flagged provider-compatibility issue. Two issues remain: _text_from_hit can silently produce empty-string chunks that bypass the no-context fallback and mislead the LLM, and AnswerRequest.query carries no upper-length bound, leaving the VectorDB embedding call and LLM context window unguarded against abnormally large inputs.

nemo_retriever/src/nemo_retriever/service/routers/ingest.py — the _text_from_hit helper and AnswerRequest model definition.

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/service/routers/ingest.py	Adds POST /v1/answer endpoint with lazy-cached LiteLLMClient, VectorDB proxy, mode guard, and proper 502/404 error surfacing; minor concerns around empty-chunk passthrough and missing query length validation.
nemo_retriever/src/nemo_retriever/llm/clients/litellm.py	Adds reasoning control support (reasoning_enabled defaults True, no-reasoning controls only applied when explicitly False), custom RAG system prompt, and per-call extra_params merging; correct opt-in behaviour for chat_template_kwargs.
nemo_retriever/helm/templates/configmap.yaml	Wires operator-managed answer LLM URL/model into retriever-service ConfigMap; api_key rendered as null (secret injected via env); nemo-retriever.nimOperator.url correctly gates on enabled flag.
nemo_retriever/helm/templates/nims/answer-llm.yaml	New NIMCache/NIMService template with pull-secret validation, NGC_API_KEY sourced from authSecret, and conditional rendering gated on CRD presence + nims.enabled + answer_llm.enabled.
nemo_retriever/helm/templates/deployment.yaml	Adds nemo-retriever.llmApiKeyEnv helper that injects NEMO_RETRIEVER_LLM_API_KEY from an explicit Secret or from the operator NIM authSecret.
nemo_retriever/src/nemo_retriever/service/config.py	Adds LLMConfig Pydantic model with extra=forbid, model_validator ensuring model is non-empty when enabled, and safe defaults.
nemo_retriever/tests/test_service_answer_generation.py	New test file covering LLMConfig validation, answer endpoint happy path, LLM-disabled 404, LLM 502 surfacing, and reasoning override.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Service as Retriever Service POST /v1/answer
    participant VectorDB as VectorDB Pod /v1/query
    participant LLM as LiteLLMClient app.state cached
    Client->>Service: POST /v1/answer query top_k reasoning_enabled
    Service->>Service: "Guard vectordb.enabled llm.enabled mode==gateway"
    Service->>VectorDB: POST /v1/query query top_k
    alt VectorDB error
        VectorDB-->>Service: error or non-200
        Service-->>Client: HTTP 502 or passthrough status
    end
    VectorDB-->>Service: results hits
    Service->>Service: extract chunks and metadata from hits
    Service->>Service: lazy-init LiteLLMClient cached in app.state
    Service->>LLM: generate query chunks reasoning_enabled
    Note over LLM: If reasoning_enabled=False prepend /no_think and chat_template_kwargs
    LLM-->>Service: GenerationResult answer model latency_s error
    alt gen.error set
        Service-->>Client: HTTP 502 LLM generation failed
    end
    Service-->>Client: AnswerResponse query answer model latency_s chunk_count chunks metadata

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nemo_retriever/src/nemo_retriever/service/routers/ingest.py:1433-1434
`_text_from_hit` returns `""` when none of the four expected field names match. When the VectorDB returns hits that happen to not carry `text`, `content`, `chunk`, or `page_content` (e.g., an unexpected schema), `chunks` becomes `["", ...]` — a non-empty list that passes the truthiness check in `_build_rag_prompt` — so the LLM receives separator-only context rather than the `"(no context retrieved)"` fallback, silently degrading answer quality. Filtering empty strings out before the prompt is built avoids the silent degradation.

```suggestion
    chunks_and_meta = [(c, _metadata_from_hit(hit)) for hit in hits if (c := _text_from_hit(hit))]
    chunks = [c for c, _ in chunks_and_meta]
    metadata = [m for _, m in chunks_and_meta]
```

### Issue 2 of 2
nemo_retriever/src/nemo_retriever/service/routers/ingest.py:84-86
`query` has no maximum-length constraint. An arbitrarily large query string is forwarded verbatim to both the VectorDB embedding call and the LLM prompt, which can exhaust the LLM's context window or generate unexpectedly expensive requests. Adding a reasonable upper bound (e.g., 32 KB) is consistent with the `top_k` bounds already present and follows the input-validation-at-boundaries rule in the project's style guide.

```suggestion
class AnswerRequest(BaseModel):
    query: str = Field(min_length=1, max_length=32_768)
    top_k: int = Field(default=5, ge=1, le=1000)
```

_{Reviews (19): Last reviewed commit: "Merge branch 'main' into codex/super49b-..." | Re-trigger Greptile}

…-nim # Conflicts: # nemo_retriever/pyproject.toml # nemo_retriever/uv.lock

charlesbluca · 2026-06-05T15:41:21Z

Ignoring the VectorDB timeout nit; /v1/query shares this timeout quirk, so changes there are better suited as general tech debt versus in scope for this PR.

charlesbluca changed the title ~~[codex] Add Super-49B Helm answer generation~~ Add Super-49B Helm answer generation Jun 2, 2026

charlesbluca force-pushed the codex/super49b-helm-nim branch from 63857d9 to a643751 Compare June 2, 2026 18:34