Add configurable answer LLM generation to service mode#2202
Add configurable answer LLM generation to service mode#2202charlesbluca wants to merge 18 commits into
Conversation
63857d9 to
a643751
Compare
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
Greptile SummaryThis PR adds configurable answer-generation LLM support to the retriever service, wiring
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/service/routers/ingest.py | Adds POST /v1/answer endpoint with lazy-cached LiteLLMClient, VectorDB proxy, mode guard, and proper 502/404 error surfacing; minor concerns around empty-chunk passthrough and missing query length validation. |
| nemo_retriever/src/nemo_retriever/llm/clients/litellm.py | Adds reasoning control support (reasoning_enabled defaults True, no-reasoning controls only applied when explicitly False), custom RAG system prompt, and per-call extra_params merging; correct opt-in behaviour for chat_template_kwargs. |
| nemo_retriever/helm/templates/configmap.yaml | Wires operator-managed answer LLM URL/model into retriever-service ConfigMap; api_key rendered as null (secret injected via env); nemo-retriever.nimOperator.url correctly gates on enabled flag. |
| nemo_retriever/helm/templates/nims/answer-llm.yaml | New NIMCache/NIMService template with pull-secret validation, NGC_API_KEY sourced from authSecret, and conditional rendering gated on CRD presence + nims.enabled + answer_llm.enabled. |
| nemo_retriever/helm/templates/deployment.yaml | Adds nemo-retriever.llmApiKeyEnv helper that injects NEMO_RETRIEVER_LLM_API_KEY from an explicit Secret or from the operator NIM authSecret. |
| nemo_retriever/src/nemo_retriever/service/config.py | Adds LLMConfig Pydantic model with extra=forbid, model_validator ensuring model is non-empty when enabled, and safe defaults. |
| nemo_retriever/tests/test_service_answer_generation.py | New test file covering LLMConfig validation, answer endpoint happy path, LLM-disabled 404, LLM 502 surfacing, and reasoning override. |
Sequence Diagram
sequenceDiagram
participant Client
participant Service as Retriever Service POST /v1/answer
participant VectorDB as VectorDB Pod /v1/query
participant LLM as LiteLLMClient app.state cached
Client->>Service: POST /v1/answer query top_k reasoning_enabled
Service->>Service: "Guard vectordb.enabled llm.enabled mode==gateway"
Service->>VectorDB: POST /v1/query query top_k
alt VectorDB error
VectorDB-->>Service: error or non-200
Service-->>Client: HTTP 502 or passthrough status
end
VectorDB-->>Service: results hits
Service->>Service: extract chunks and metadata from hits
Service->>Service: lazy-init LiteLLMClient cached in app.state
Service->>LLM: generate query chunks reasoning_enabled
Note over LLM: If reasoning_enabled=False prepend /no_think and chat_template_kwargs
LLM-->>Service: GenerationResult answer model latency_s error
alt gen.error set
Service-->>Client: HTTP 502 LLM generation failed
end
Service-->>Client: AnswerResponse query answer model latency_s chunk_count chunks metadata
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
nemo_retriever/src/nemo_retriever/service/routers/ingest.py:1433-1434
`_text_from_hit` returns `""` when none of the four expected field names match. When the VectorDB returns hits that happen to not carry `text`, `content`, `chunk`, or `page_content` (e.g., an unexpected schema), `chunks` becomes `["", ...]` — a non-empty list that passes the truthiness check in `_build_rag_prompt` — so the LLM receives separator-only context rather than the `"(no context retrieved)"` fallback, silently degrading answer quality. Filtering empty strings out before the prompt is built avoids the silent degradation.
```suggestion
chunks_and_meta = [(c, _metadata_from_hit(hit)) for hit in hits if (c := _text_from_hit(hit))]
chunks = [c for c, _ in chunks_and_meta]
metadata = [m for _, m in chunks_and_meta]
```
### Issue 2 of 2
nemo_retriever/src/nemo_retriever/service/routers/ingest.py:84-86
`query` has no maximum-length constraint. An arbitrarily large query string is forwarded verbatim to both the VectorDB embedding call and the LLM prompt, which can exhaust the LLM's context window or generate unexpectedly expensive requests. Adding a reasonable upper bound (e.g., 32 KB) is consistent with the `top_k` bounds already present and follows the input-validation-at-boundaries rule in the project's style guide.
```suggestion
class AnswerRequest(BaseModel):
query: str = Field(min_length=1, max_length=32_768)
top_k: int = Field(default=5, ge=1, le=1000)
```
Reviews (19): Last reviewed commit: "Merge branch 'main' into codex/super49b-..." | Re-trigger Greptile
This comment has been minimized.
This comment has been minimized.
824dc2d to
a456393
Compare
This comment has been minimized.
This comment has been minimized.
a456393 to
85dcb49
Compare
This comment has been minimized.
This comment has been minimized.
…-nim # Conflicts: # nemo_retriever/pyproject.toml # nemo_retriever/uv.lock
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Ignoring the VectorDB timeout nit; |
Description
Adds configurable answer-generation LLM support to the retriever service and Helm chart. The optional operator-managed
nimOperator.answer_llmblock defaults to the Super-49B NIM, but its image, model id, model profile, resources, service name, and env can be overridden for other compatible LLM NIMs such as Nemotron/Nano 3 30B. The same service can also point at an external OpenAI-compatible endpoint throughserviceConfig.llm.Changes include:
POST /v1/answerin the service: VectorDB retrieval followed by LiteLLM answer generation, with lazy client caching, 404/502 error handling, and independentinclude_chunks/include_metadataresponse flags.answer_llmNIMCache/NIMService rendering, including pull-secret validation, Super-49B defaults, Nano override coverage, and automatic in-clusterapi_base/modelwiring.api_key: null; the runtime API key is injected fromNEMO_RETRIEVER_LLM_API_KEYvia either explicitserviceConfig.llm.apiKeySecretor the operator NIM auth secret.reasoning_enabled=Falseapplies the shared no-reasoning request metadata/prompt controls used by the Super-49B and Nano NIMs, while generic library clients default to preserving provider behavior.litellm>=1.86.0,<2service dependency plus focused unit/Helm tests for service config, answer generation, reasoning, packaging, and model swaps.Validation:
uv run --project nemo_retriever pytest -q nemo_retriever/tests/test_llm_params.py nemo_retriever/tests/test_service_answer_generation.py nemo_retriever/tests/test_helm_answer_llm_generation.py nemo_retriever/tests/test_live_rag.py nemo_retriever/tests/test_service_pipeline_spec.py nemo_retriever/tests/test_service_packaging.pyuv run pre-commit run --all-filesChecklist