Skip to content

Add configurable answer LLM generation to service mode#2202

Open
charlesbluca wants to merge 18 commits into
NVIDIA:mainfrom
charlesbluca:codex/super49b-helm-nim
Open

Add configurable answer LLM generation to service mode#2202
charlesbluca wants to merge 18 commits into
NVIDIA:mainfrom
charlesbluca:codex/super49b-helm-nim

Conversation

@charlesbluca

@charlesbluca charlesbluca commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Description

Adds configurable answer-generation LLM support to the retriever service and Helm chart. The optional operator-managed nimOperator.answer_llm block defaults to the Super-49B NIM, but its image, model id, model profile, resources, service name, and env can be overridden for other compatible LLM NIMs such as Nemotron/Nano 3 30B. The same service can also point at an external OpenAI-compatible endpoint through serviceConfig.llm.

Changes include:

  • POST /v1/answer in the service: VectorDB retrieval followed by LiteLLM answer generation, with lazy client caching, 404/502 error handling, and independent include_chunks / include_metadata response flags.
  • Generic Helm answer_llm NIMCache/NIMService rendering, including pull-secret validation, Super-49B defaults, Nano override coverage, and automatic in-cluster api_base / model wiring.
  • Secret-safe LLM auth: ConfigMap renders api_key: null; the runtime API key is injected from NEMO_RETRIEVER_LLM_API_KEY via either explicit serviceConfig.llm.apiKeySecret or the operator NIM auth secret.
  • Unified reasoning control: reasoning_enabled=False applies the shared no-reasoning request metadata/prompt controls used by the Super-49B and Nano NIMs, while generic library clients default to preserving provider behavior.
  • Stable litellm>=1.86.0,<2 service dependency plus focused unit/Helm tests for service config, answer generation, reasoning, packaging, and model swaps.

Validation:

  • uv run --project nemo_retriever pytest -q nemo_retriever/tests/test_llm_params.py nemo_retriever/tests/test_service_answer_generation.py nemo_retriever/tests/test_helm_answer_llm_generation.py nemo_retriever/tests/test_live_rag.py nemo_retriever/tests/test_service_pipeline_spec.py nemo_retriever/tests/test_service_packaging.py
  • uv run pre-commit run --all-files
  • PR CI green: pre-commit, retriever unit tests, slim import contract, library-mode installs, Docker build/test, and PR validation.
  • Manual smoke covered Helm answer generation and standalone Docker comparison of Super-49B/Nano reasoning controls.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@charlesbluca charlesbluca changed the title [codex] Add Super-49B Helm answer generation Add Super-49B Helm answer generation Jun 2, 2026
@charlesbluca charlesbluca force-pushed the codex/super49b-helm-nim branch from 63857d9 to a643751 Compare June 2, 2026 18:34
@charlesbluca

This comment has been minimized.

1 similar comment
@charlesbluca

This comment has been minimized.

@greptile-apps

greptile-apps Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds configurable answer-generation LLM support to the retriever service, wiring POST /v1/answer (VectorDB retrieval + LiteLLM generation) through a new LLMConfig, a Helm answer_llm NIMCache/NIMService template, and secret-safe API key injection via NEMO_RETRIEVER_LLM_API_KEY.

  • POST /v1/answer in ingest.py: lazy-cached LiteLLMClient, VectorDB proxy with 502 surfacing on failure, 404 guards for disabled VectorDB/LLM and worker pods, and per-request reasoning_enabled override.
  • litellm.py reasoning controls: reasoning_enabled now defaults True (preserving provider behaviour); chat_template_kwargs and /no_think directives are only injected when explicitly set to False, fixing the previously flagged default-breaking-non-Nemotron issue.
  • Helm wiring: answer-llm.yaml NIMCache/NIMService, configmap.yaml auto-wires operator URL/model when nimOperator.answer_llm.enabled: true, deployment.yaml injects NEMO_RETRIEVER_LLM_API_KEY from a Secret reference; api_key: null is rendered in the ConfigMap, addressing the prior plaintext-key concern.

Confidence Score: 4/5

Safe to merge after addressing the empty-chunk passthrough and missing query length bound; the core flow (secret injection, operator URL gating, reasoning defaults) is correct.

The LLM API key is properly kept out of the ConfigMap and injected via Secret-backed env vars. The nemo-retriever.nimOperator.url helper correctly gates on enabled so no phantom URL is generated when the answer LLM is disabled. The reasoning_enabled default flip to True fixes the previously flagged provider-compatibility issue. Two issues remain: _text_from_hit can silently produce empty-string chunks that bypass the no-context fallback and mislead the LLM, and AnswerRequest.query carries no upper-length bound, leaving the VectorDB embedding call and LLM context window unguarded against abnormally large inputs.

nemo_retriever/src/nemo_retriever/service/routers/ingest.py — the _text_from_hit helper and AnswerRequest model definition.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/service/routers/ingest.py Adds POST /v1/answer endpoint with lazy-cached LiteLLMClient, VectorDB proxy, mode guard, and proper 502/404 error surfacing; minor concerns around empty-chunk passthrough and missing query length validation.
nemo_retriever/src/nemo_retriever/llm/clients/litellm.py Adds reasoning control support (reasoning_enabled defaults True, no-reasoning controls only applied when explicitly False), custom RAG system prompt, and per-call extra_params merging; correct opt-in behaviour for chat_template_kwargs.
nemo_retriever/helm/templates/configmap.yaml Wires operator-managed answer LLM URL/model into retriever-service ConfigMap; api_key rendered as null (secret injected via env); nemo-retriever.nimOperator.url correctly gates on enabled flag.
nemo_retriever/helm/templates/nims/answer-llm.yaml New NIMCache/NIMService template with pull-secret validation, NGC_API_KEY sourced from authSecret, and conditional rendering gated on CRD presence + nims.enabled + answer_llm.enabled.
nemo_retriever/helm/templates/deployment.yaml Adds nemo-retriever.llmApiKeyEnv helper that injects NEMO_RETRIEVER_LLM_API_KEY from an explicit Secret or from the operator NIM authSecret.
nemo_retriever/src/nemo_retriever/service/config.py Adds LLMConfig Pydantic model with extra=forbid, model_validator ensuring model is non-empty when enabled, and safe defaults.
nemo_retriever/tests/test_service_answer_generation.py New test file covering LLMConfig validation, answer endpoint happy path, LLM-disabled 404, LLM 502 surfacing, and reasoning override.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Service as Retriever Service POST /v1/answer
    participant VectorDB as VectorDB Pod /v1/query
    participant LLM as LiteLLMClient app.state cached
    Client->>Service: POST /v1/answer query top_k reasoning_enabled
    Service->>Service: "Guard vectordb.enabled llm.enabled mode==gateway"
    Service->>VectorDB: POST /v1/query query top_k
    alt VectorDB error
        VectorDB-->>Service: error or non-200
        Service-->>Client: HTTP 502 or passthrough status
    end
    VectorDB-->>Service: results hits
    Service->>Service: extract chunks and metadata from hits
    Service->>Service: lazy-init LiteLLMClient cached in app.state
    Service->>LLM: generate query chunks reasoning_enabled
    Note over LLM: If reasoning_enabled=False prepend /no_think and chat_template_kwargs
    LLM-->>Service: GenerationResult answer model latency_s error
    alt gen.error set
        Service-->>Client: HTTP 502 LLM generation failed
    end
    Service-->>Client: AnswerResponse query answer model latency_s chunk_count chunks metadata
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nemo_retriever/src/nemo_retriever/service/routers/ingest.py:1433-1434
`_text_from_hit` returns `""` when none of the four expected field names match. When the VectorDB returns hits that happen to not carry `text`, `content`, `chunk`, or `page_content` (e.g., an unexpected schema), `chunks` becomes `["", ...]` — a non-empty list that passes the truthiness check in `_build_rag_prompt` — so the LLM receives separator-only context rather than the `"(no context retrieved)"` fallback, silently degrading answer quality. Filtering empty strings out before the prompt is built avoids the silent degradation.

```suggestion
    chunks_and_meta = [(c, _metadata_from_hit(hit)) for hit in hits if (c := _text_from_hit(hit))]
    chunks = [c for c, _ in chunks_and_meta]
    metadata = [m for _, m in chunks_and_meta]
```

### Issue 2 of 2
nemo_retriever/src/nemo_retriever/service/routers/ingest.py:84-86
`query` has no maximum-length constraint. An arbitrarily large query string is forwarded verbatim to both the VectorDB embedding call and the LLM prompt, which can exhaust the LLM's context window or generate unexpectedly expensive requests. Adding a reasonable upper bound (e.g., 32 KB) is consistent with the `top_k` bounds already present and follows the input-validation-at-boundaries rule in the project's style guide.

```suggestion
class AnswerRequest(BaseModel):
    query: str = Field(min_length=1, max_length=32_768)
    top_k: int = Field(default=5, ge=1, le=1000)
```

Reviews (19): Last reviewed commit: "Merge branch 'main' into codex/super49b-..." | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/retriever.py Outdated
Comment thread nemo_retriever/pyproject.toml Outdated
@greptile-apps

This comment has been minimized.

Comment thread nemo_retriever/src/nemo_retriever/service/routers/ingest.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/service/routers/ingest.py Outdated
Comment thread nemo_retriever/pyproject.toml Outdated
Comment thread nemo_retriever/tests/test_service_answer_generation.py
@charlesbluca charlesbluca force-pushed the codex/super49b-helm-nim branch 4 times, most recently from 824dc2d to a456393 Compare June 2, 2026 20:51
@charlesbluca

This comment has been minimized.

@charlesbluca charlesbluca force-pushed the codex/super49b-helm-nim branch from a456393 to 85dcb49 Compare June 2, 2026 21:13
@charlesbluca

This comment has been minimized.

@charlesbluca

This comment has been minimized.

@charlesbluca

This comment has been minimized.

@charlesbluca

This comment has been minimized.

@charlesbluca

This comment has been minimized.

@charlesbluca

This comment has been minimized.

Comment thread nemo_retriever/src/nemo_retriever/llm/clients/litellm.py
@charlesbluca

This comment has been minimized.

@charlesbluca

This comment has been minimized.

@charlesbluca

This comment has been minimized.

@charlesbluca

This comment has been minimized.

@charlesbluca charlesbluca changed the title Add Super-49B Helm answer generation Add configurable answer LLM generation Jun 3, 2026
@charlesbluca charlesbluca marked this pull request as ready for review June 4, 2026 13:51
@charlesbluca charlesbluca requested review from a team as code owners June 4, 2026 13:51
@charlesbluca charlesbluca requested a review from jperez999 June 4, 2026 13:51
@charlesbluca charlesbluca changed the title Add configurable answer LLM generation Add configurable answer LLM generation to service mode Jun 4, 2026
@charlesbluca

Copy link
Copy Markdown
Collaborator Author

Ignoring the VectorDB timeout nit; /v1/query shares this timeout quirk, so changes there are better suited as general tech debt versus in scope for this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant