上下文请参考 #3188 (comment)
Background
The application-level metadata path after #2534 includes local MetadataInfo, revision calculation, metadata report publishing/loading, service-app mapping, and RPC MetadataService. When service discovery fails, it is currently difficult to identify which stage caused the failure. This issue only focuses on observability for the application-level metadata path.
Related code:
metadata/report_instance.go
metadata/mapping/metadata/service_name_mapping.go
metadata/client.go
registry/servicediscovery/service_instances_changed_listener_impl.go
Current Problems
Current observability is not enough to diagnose application-level metadata failures.
Specific problems:
- Mapping register, get, listen, and remove operations do not have unified metrics or structured logs.
- Revision calculation and metadata cache hit/miss are hard to observe.
- Errors from metadata report loading, RPC metadata loading, URL construction, and mapping are not clearly categorized.
- When consumers cannot discover services, it is hard to tell whether the root cause is mapping, metadata report, RPC metadata service, revision, or cache.
Suggestions
- Add metrics and structured logs for mapping register, get, listen, and remove operations.
- Expose metadata source, storage type, revision, and cache hit/miss where appropriate.
- Use clear error categories for metadata report failure, RPC metadata failure, URL construction failure, revision mismatch, and mapping failure.
- Include useful context such as app, revision, service key, registry id, and storage type.
- Add failure-path tests that verify error messages or categories.
背景
#2534 之后的应用级 metadata 链路包括本地 MetadataInfo、revision 计算、metadata report 发布/获取、service-app mapping 以及 RPC MetadataService。当服务发现失败时,目前较难判断具体失败阶段。本 issue 只关注应用级 metadata 链路的可观测性。
相关代码:
metadata/report_instance.go
metadata/mapping/metadata/service_name_mapping.go
metadata/client.go
registry/servicediscovery/service_instances_changed_listener_impl.go
当前问题
当前可观测性不足以诊断应用级 metadata 失败。
具体问题:
- mapping register、get、listen、remove 缺少统一指标或结构化日志。
- revision 计算和 metadata cache hit/miss 难以观测。
- metadata report 获取、RPC metadata 获取、URL 构造和 mapping 相关错误没有清晰分类。
- consumer 订阅不到服务时,难以判断根因是 mapping、metadata report、RPC metadata service、revision 还是 cache。
建议
- 为 mapping register、get、listen、remove 增加指标和结构化日志。
- 在适当位置暴露 metadata source、storage type、revision、cache hit/miss。
- 为 metadata report failure、RPC metadata failure、URL construction failure、revision mismatch、mapping failure 使用清晰错误分类。
- 增加 app、revision、service key、registry id、storage type 等有用上下文。
- 增加失败路径测试,验证错误信息或错误分类。
上下文请参考 #3188 (comment)
Background
The application-level metadata path after
#2534includes localMetadataInfo, revision calculation, metadata report publishing/loading, service-app mapping, and RPCMetadataService. When service discovery fails, it is currently difficult to identify which stage caused the failure. This issue only focuses on observability for the application-level metadata path.Related code:
metadata/report_instance.gometadata/mapping/metadata/service_name_mapping.gometadata/client.goregistry/servicediscovery/service_instances_changed_listener_impl.goCurrent Problems
Current observability is not enough to diagnose application-level metadata failures.
Specific problems:
Suggestions
背景
#2534之后的应用级 metadata 链路包括本地MetadataInfo、revision 计算、metadata report 发布/获取、service-app mapping 以及 RPCMetadataService。当服务发现失败时,目前较难判断具体失败阶段。本 issue 只关注应用级 metadata 链路的可观测性。相关代码:
metadata/report_instance.gometadata/mapping/metadata/service_name_mapping.gometadata/client.goregistry/servicediscovery/service_instances_changed_listener_impl.go当前问题
当前可观测性不足以诊断应用级 metadata 失败。
具体问题:
建议