Skip to content

test(benchmark): add latency bench for envector-msa-1.4.0 #99

Draft
heeyeon01 wants to merge 16 commits into
mainfrom
benchmark/envector-latency-v1.4.0
Draft

test(benchmark): add latency bench for envector-msa-1.4.0 #99
heeyeon01 wants to merge 16 commits into
mainfrom
benchmark/envector-latency-v1.4.0

Conversation

@heeyeon01
Copy link
Copy Markdown
Contributor

@heeyeon01 heeyeon01 commented May 4, 2026

Rune × envector-msa-1.4.0 Latency Report

[측정 환경] envector 연결은 secure=false(plaintext gRPC)로 측정. Vault(localhost:50051)는 TLS 활성 상태.
plaintext로 진행해도 무방하다고 판단한 이유: gRPC는 HTTP/2 커넥션을 재사용하므로 TLS 핸드셰이크 비용이 warmup 이후 측정값에 포함되지 않으며, 주요 처리 시간이 FHE 연산에서 발생해 TLS 유무가 결과에 미치는 영향이 작다고 판단.
이후 envector TLS 환경(secure=true)에서도 동일 시나리오로 벤치마크 진행 예정.

[2026-05-05 재측정 노트] 1차 측정(2026-05-04)에서 vault_topk = 0 ms로 FHE 복호화 분기 비활성이 의심되던 부분이 해소됨. eval key 정상 로드 상태에서 end-to-end FHE 경로(novelty check + topk decrypt)를 모두 거치는 결과. 1차 대비 capture는 ~50배, recall은 ~10배 증가했고, 차이는 1차 측정에 포함되지 않았던 FHE 복호화 비용에서 발생.

실험 계획 및 표 읽는 법

시나리오 정의 / 측정 방법론

전체 시나리오(T1–T9), 단계 분해, 통계 기준, 환경 조건, 검증 방법은 다음 문서 참조:
https://github.com/CryptoLabInc/rune/blob/benchmark/envector-latency-v1.4.0/benchmark/plans/latency_bench_plan_envector_v1.4.0.md

통계 약자

  • n: 워밍업(첫 2회)을 제외하고 실제 측정에 사용한 반복 횟수.
  • p50: 측정값을 정렬했을 때 50번째 백분위. 중앙값 = "보통 케이스".
  • p95: 95번째 백분위. 가장 느린 5%를 제외한 시간 = "운 나쁠 때 케이스".
  • 평균(mean) 대신 p50/p95를 쓰는 이유는 outlier 한 번이 평균을 끌어올려 "보통 속도"를 왜곡하기 때문.

단계 이름

단계 하는 일 어디서
embed 텍스트 → 벡터 변환 (Qwen3-Embedding-0.6B) 로컬 CPU
score 암호화된 벡터로 유사도 계산 요청 envector cloud (FHE)
vault_topk envector가 돌려준 암호화 점수를 복호화 Rune-Vault (로컬 gRPC)
insert 새 데이터를 암호화하여 envector에 저장 로컬 encrypt + envector cloud
remind recall 시 메타데이터 조회 로컬 JSON
total 위 단계들의 합 (= 사용자가 체감하는 end-to-end 시간)

vault_topk = 0 ms 의 의미: envector가 빈 암호화 점수 리스트를 돌려줘서 Vault 복호화 분기가 skip 됐다는 신호. 진짜 0초가 걸린 게 아니라 단계 자체가 실행되지 않았음을 의미.

실행 결과 요약

시나리오 feature p50 total p95 total n
T1 짧은 영문 (~35 tokens) capture 5269.5 ms 5449.8 ms 8
T2 긴 영문 (~155 tokens) capture 4707.0 ms 4931.9 ms 8
T3 한국어 capture 4595.2 ms 4789.1 ms 8
T4 중복 입력 (near-duplicate path) capture 4886.1 ms 5020.2 ms 8
T5 Recall — exact match recall 368.2 ms 385.3 ms 8
T6 Recall — 한→영 cross-language recall 384.1 ms 464.3 ms 8
T7 Recall — topk 1/3/5/10 recall 367–376 ms 374–402 ms 5 each
T8 Batch per-item (size 1/5/10/20) batch_capture 238–245 ms 242–246 ms 3 each
T9 Vault health check vault_status 0.8 ms 0.9 ms 8

capture 시간 분해: insert (FHE encrypt + 원격 insert, 4200–5000 ms) 가 전체의 91–95%. embed/score/vault_topk 합계 5–9%.
recall 시간 분해: score (180–200 ms) + vault_topk (100–115 ms) + remind (45–47 ms) 합계가 전체의 ~80%, embedding 28–47 ms.
batch per-item: 238–245 ms (batch size 1, 5, 10, 20 전 구간 ±3 ms 이내). batch_capture는 insert를 수행하지 않고 embed +
score(novelty)만 측정.
recall topk-독립: T7에서 topk = 1, 3, 5, 10 모두 p50 367–376 ms. 차이 < 3%.

상세 phase별 표(p50, p95, p99, mean)는 benchmark/reports/latency_results_v1.4.0_2026-05-05.md 참조.


주목할 점

vault_topk / remind 가 정상 측정됨 — 1차 측정의 0 ms 문제 해소

1차 리포트에서 모든 시나리오의 vault_topk = 0.0 ms였던 것이 이번에는 capture 31–100 ms / recall 100–115 ms로 측정됨. eval key 로드 상태가 정상화되어 FHE encrypted_blobs가 비어있지 않게 되었고, Vault gRPC 복호화 분기가 실행됨.

부수 인프라 수정: mcp/adapter/vault_client.pyMAX_MESSAGE_LENGTH를 256 MB → ~1.95 GB로 상향. EvalKey 응답이 ~1.18 GB라 256MB 한도에서 RESOURCE_EXHAUSTED로 실패했음 (별도 fix 커밋 d08e1bb).

capture 시간 대부분이 insert에서 소비됨 (~4.5초, 전체의 92%)

insert phase가 capture 전체 시간의 91–95%. 내부적으로 FHE 암호화 + 원격 envector insert가 합쳐진 구간이며, 절대값은 시나리오별로 4.2–5.0초. 다음 측정 라운드에서:

  1. FHE encrypt 단독 시간 (로컬 CPU)
  2. 원격 envector insert_data gRPC 시간 (네트워크 + 서버측 저장)

을 분리 계측하면 시간이 어디서 발생하는지 더 명확해짐.

참고: envector-msa-1.4.0의 PPMM cache 최적화(docs/design/ppmm-cache-optimization-analysis.md)는 Search 경로 전용이며 insert와 무관. 또한 단일 쿼리(m=1) 시나리오에서는 효과 없음.

recall score 단계가 capture novelty score의 약 2배 시간

같은 score 호출이지만 capture 시 100 ms, recall 시 180 ms. capture는 novelty check만 하고 단순 거리 metric을 받는 반면, recall은 그 결과로 후속 vault_topk + remind 분기를 실행하며 ScoreEntry payload가 다르기 때문으로 추정. 코드 경로 확인 필요.

T3 한국어 capture가 영문 대비 score/vault_topk에서 추가 시간 발생

embed score vault_topk insert
T1 영문 92.6 100.4 31.9 5018.1
T3 한국어 126.9 (+37%) 167.5 (+67%) 75.6 (+137%) 4209.0 (-16%)

embed/score/vault_topk는 한국어가 영문보다 시간이 더 걸리고 insert는 줄어듦. embed는 토크나이저 subword 차이로 설명 가능하나, score/vault_topk가 토큰 수에 영향받는 이유는 명확하지 않음 (이론적으로 FHE 벡터 길이가 동일해야 함). 토크나이저 출력 vector dim 차이 또는 payload 길이 차이가 직렬화 오버헤드로 나타났을 가능성.

T4 중복 입력의 score가 다른 capture 시나리오의 약 2배

T1 score 100 ms, T4 score 211 ms. T4는 같은 텍스트를 두 번 capture하는 시나리오로, novelty 결과가 high-similarity로 나오면서 vault_topk decrypt 분기가 다른 path를 실행하기 때문으로 추정. 1차 측정의 p95 spike도 같은 원인일 가능성.

batch_capture per-item이 단일 capture 대비 약 1/20

per-item 240 ms, 단일 capture 5000 ms. 차이의 대부분은 batch_capture가 insert를 수행하지 않고 embed + score(novelty)만 측정하기 때문. batch insert 포함 시간이 필요하면 runner에 별도 시나리오 추가 필요.

heeyeon01 and others added 4 commits May 4, 2026 16:42
…ector 1.4

- Add optional `secure` field to EnVectorConfig, EnVectorClient, SDK adapter,
  and MCP server; reads from config.json and ENVECTOR_SECURE env var
- Normalize query_encryption bool → string ("plain"/"cipher") to match
  pyenvector 1.4 API

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Switches Claude Code MCP registration from relying on plugin.json
mcpServers auto-discovery to explicit `claude mcp add --scope user`
invocation, which is more reliable and avoids path-conflict issues
with the plugin system.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When ENVECTOR_ENDPOINT or ENVECTOR_API_KEY are not set via environment
variables, read them from the envector section of ~/.rune/config.json,
consistent with how other config values (secure, vault) are loaded.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Measures per-phase latency (embed → FHE score → vault_topk → insert/recall)
for capture, recall, batch_capture, and vault_status against live envector-msa-1.4.0.
Includes plan doc, v1.4.0 results report, and README updates.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@heeyeon01 heeyeon01 changed the title test(benchmark): add latency bench for envector-msa-1.4.0 진행중 - test(benchmark): add latency bench for envector-msa-1.4.0 May 4, 2026
heeyeon01 and others added 4 commits May 5, 2026 20:32
- agents/common/envector_client.py: read ENVECTOR_EVAL_MODE env var
  instead of hardcoding "rmp"
- mcp/server/server.py: switch default eval_mode from "rmp" to "mm"
  in both runtime adapter init and CLI flag

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… EvalKey

EvalKey in pyenvector >= 1.4.0 reaches ~1.2 GB, exceeding the previous
256 MB cap. Set the limit to 2000 MB — the largest round value that stays
under INT32_MAX (gRPC's hard ceiling).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end FHE 경로(eval key 로드 후, vault_topk 분기 활성)에서
재측정. 어제 대비 capture ~×50, recall ~×10 증가는 그동안 측정되지
않던 FHE 복호화 비용이 드디어 포함된 결과.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
설치/측정 환경의 실제 버전은 stable 1.4.0 (pre-release a5 아님).
plan과 2026-05-05 리포트 둘 다 동일하게 수정.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@heeyeon01 heeyeon01 changed the title 진행중 - test(benchmark): add latency bench for envector-msa-1.4.0 test(benchmark): add latency bench for envector-msa-1.4.0 May 5, 2026
heeyeon01 and others added 2 commits May 5, 2026 21:02
plan에 추상적으로 적혀있던 두 변경 사항의 정확한 출처와 적용 범위를 표기:
- `secure` 파라미터: SDK Configuration 문서.
- PPMM cache 최적화: query batch m=4–16에서 1.20–1.40x (V2) / 1.35–1.59x (V3),
  단일 쿼리(m=1)에는 효과 없음. 검색(Search) 경로 전용.

리포트의 capture insert 분석에서 PPMM 인용을 제거. PPMM은 Search 경로 최적화이고
insert/encrypt 경로와는 무관함을 envector design doc로 확인.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t section

데이터(~92%)는 사실로 유지하되 헤딩 표현을 중립적으로 조정.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@heeyeon01 heeyeon01 requested a review from sunchuljung May 5, 2026 12:16
@heeyeon01 heeyeon01 self-assigned this May 6, 2026
@heeyeon01 heeyeon01 requested a review from a team May 6, 2026 00:28
heeyeon01 and others added 6 commits May 6, 2026 13:59
GetPublicKey streams ~1.1 GiB EvalKey for pyenvector 1.4 and exceeds
the shared 30s default deadline (measured ~38s on a 30 MiB/s link).
Companion to d08e1bb, which raised the gRPC message size limit but
left the deadline untouched. Standard RPC timeout stays at 30s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply Prettier-style cleanup to all .md files modified vs main on this
branch: pad table cells to equal visual width (CJK-aware), insert blank
lines before fenced code blocks and tables, and drop trailing two-space
hard breaks where the next line is already a block boundary. Intentional
hard breaks in paragraph-fragment lists are preserved. No content changes.
Plumb the new envector_secure field through fetch_keys_from_vault
(now an 8-tuple) and apply it to EnVectorConfig.secure in both the
reload (_init_pipelines) and startup-init paths. When Vault provides
the field, the value is persisted to ~/.rune/config.json so the
EnVectorClient/Adapter is constructed with the right TLS mode on
subsequent runs.

Backward-compatible: older Vault servers that omit the field cause
vault_ev_secure to come back as None, in which case the client leaves
the existing secure setting (env var → config → SDK default) untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lict

Claude Code auto-registers enVector MCP from the plugin manifest
(plugin:rune:envector). The duplicate user-scope `claude mcp add`
entry competed with the plugin-scope registration during stdio
handshake and caused both to fail. Keep the legacy-entry purge so
older installs are cleaned up; let the plugin manifest own
registration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The synchronous fetch_keys_from_vault() call in main() blocked MCP
transport startup on the GetPublicKey gRPC stream. With pyenvector
1.4.0's larger EvalKey the stream now regularly exceeds Claude Code's
30s MCP startup limit, leaving the server unreachable on cold starts.
The existing _init_pipelines background thread already re-fetches and
rebuilds the adapter, so the sync block was redundant — remove it and
let pipeline init populate key_id / agent_id / agent_dek / index_name
once Vault responds. Tools that need these already gate via
_ensure_pipelines() so they wait gracefully.

Mirror the envector_not_provisioned dormant transition into
_init_pipelines so the missing-credentials signal is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The thread wrapper around fetch_keys_from_vault used a 30s timeout while
the underlying GetPublicKey gRPC call has a 90s deadline (654806e). On
cold start the gRPC call could take 40s+ due to TLS handshake racing
sentence-transformers model loading on MainThread, so the wrapper
declared failure and switched the server to dormant even though the
worker thread eventually succeeded and wrote keys to disk.

Bump the wrapper timeout to 120s (gRPC deadline + buffer for thread/
asyncio/TLS overhead) with a 30s grace re-poll so an in-flight call
returns its real result. The `with ThreadPoolExecutor` block already
waits on shutdown, so a longer timeout costs nothing on the success
path.

Also rework the ad-hoc DEBUG file logger added during diagnosis into
an opt-in RotatingFileHandler controlled by RUNE_MCP_DEBUG_LOG (=1 for
INFO, =debug for verbose). Off by default so normal runs don't write
to disk; preserved as a tool for future background-thread debugging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@esifea esifea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, we don't need to update Python implementatiom which will be deprecated after v0.4.0.

Could you please just leave the benchmark scripts that can be reusable also with Go implementation?

@heeyeon01
Copy link
Copy Markdown
Contributor Author

heeyeon01 commented May 7, 2026

In my opinion, we don't need to update Python implementatiom which will be deprecated after v0.4.0.

Could you please just leave the benchmark scripts that can be reusable also with Go implementation?

I see your point. I kept the changes because this branch also contains information about which source code Rune was based on during the benchmark measurements. The modified code is intended for integration with enVector 1.4.0(mm mode).
I'll separate and clean up the branch later. Since this is still in the testing phase, I'd prefer to keep it as is for now. That's why it's still in the draft phase.

@heeyeon01 heeyeon01 requested a review from esifea May 7, 2026 09:34
@couragehong couragehong added the DO NOT MERGE DO NOT MERGE label May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DO NOT MERGE DO NOT MERGE

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants