feat: basic egress tracking prometheus metrics#6
Conversation
alanshaw
left a comment
There was a problem hiding this comment.
I've left comments on parts of the code I don't think were meant to be in this PR...
There was a problem hiding this comment.
Suggestion: trigger this when you get a request, unless it is already running.
There was a problem hiding this comment.
you mean not starting the consolidator (https://github.com/storacha/etracker/pull/6/files#diff-24d24419fa40f605bd077ae0a3b7e2bb7f703deb66475aa1a040798bee9bfb1aR138) until we get the first request?
There was a problem hiding this comment.
No I mean instead of running on an interval - just just a request to trigger the run.
There was a problem hiding this comment.
so consolidate on every track request? I think processing them in batches might be slightly more efficient?
There was a problem hiding this comment.
Something more like... "ignoring unsuccessful retrieval: %s", receipt.Ran()
There was a problem hiding this comment.
the error happens when trying to rebind the receipt, so I think it means the receipt is not of the expected type, no?
It being a retrieval receipt, but an unsuccessful one, is a different case that I'll check in validateReceipt.
There was a problem hiding this comment.
This should be issuing a receipt with the errors no?
There was a problem hiding this comment.
true, this is not an error processing a single receipt but an error that prevents the consolidation of the full batch
There was a problem hiding this comment.
Personally I'd register a handler and call it - then you don't need to do things like issue a receipt, you can just return the ok/error type from the handler and store the returned receipt.
Note: it doesn't need to go over the network.
Right now everything looks a little weird because you're kinda manually executing the invocation.
There was a problem hiding this comment.
It's a lot of data to buffer in memory - basically the size of a batch. Why not return a iter.Seq?
There was a problem hiding this comment.
🤦🏻 I always forget about iterators. Good catch, thanks.
| } | ||
|
|
||
| // TODO: do more validation here. | ||
| // At the very least that the invocation is a retrieval invocation and the audience is the node |
There was a problem hiding this comment.
You should invoke the validator, which will check the delegation chain.
There was a problem hiding this comment.
wait, how do I do that?
There was a problem hiding this comment.
There was a problem hiding this comment.
List?
| List(ctx context.Context, nodeDID did.DID) ([]ConsolidatedRecord, error) |
There was a problem hiding this comment.
Minor, and partly because I think Decode works with strings also, but we usually parse strings and decode bytes.
There was a problem hiding this comment.
yeah, Decode accepts a string (it wouldn't compile otherwise) and Parse essentially calls Decode when the param is a string, but I'll update
| return err | ||
| } | ||
|
|
||
| metrics.TrackedBatchesPerNode.WithLabelValues(nodeDID.String()).Inc() |
There was a problem hiding this comment.
I think you need to increment in the DB also and when you start the service, read it from the DB and set the counter value. Otherwise when you restart the service your counts will be off.
There was a problem hiding this comment.
I don't think that's necessary. Prometheus handles resets of the counter in functions like rate or increase. In a sense, the actual absolute value of the counter doesn't really matter.
sorry, I forgot to rebase the branch onto main after merging the branch this PR was targeting. The good thing about this is that I got your comments on #5 🙂. I'll take care of them in a follow up. |
| @@ -0,0 +1,26 @@ | |||
| package metrics | |||
There was a problem hiding this comment.
Any reason this can't use the opentelemetry go sdk?
There was a problem hiding this comment.
no, I just wanted something quick and this is how I set up prometheus metrics endpoints in the past. As opposed to piri's case, this is a service we own so I can add secrets, and the endpoint is going to be scrapped by Grafana, so I just went the default route.
There was a problem hiding this comment.
I've been thinking (hoping even) we could move toward standardizing on OpenTelemetry across services - indexing-service, Piri, and (soon) guppy all use it.
The Go SDK has a native Prometheus exporter that works nearly the same way, so I don't see it adding much more complexity.
That said, totally get wanting to move fast here. Happy to help with OTel if you're interested, but no worries either way
There was a problem hiding this comment.
take a look at 531320f and let me know if it matches what you had in mind
There was a problem hiding this comment.
ConsolidatedBytesPerNode could probably be an integer counter, will leave it up to you. Looks good otherwise - thanks!
There was a problem hiding this comment.
ah, good catch, thank you!
| // Increment consolidated bytes counter for this node | ||
| metrics.ConsolidatedBytesPerNode.WithLabelValues(record.NodeID.String()).Add(float64(totalBytes)) |
There was a problem hiding this comment.
nit: probably cleaner to handle the recording keeping of this metric within the DB method. i.e. within consolidatedTable.Add
There was a problem hiding this comment.
I disagree with this, because the metric is not related with how the data is stored IMO. If we had more than one implementation of ConsolidatedTable we wouldn't want to have all and each of them to publish the metric.
| return err | ||
| } | ||
|
|
||
| metrics.TrackedBatchesPerNode.WithLabelValues(nodeDID.String()).Inc() |
There was a problem hiding this comment.
Similar to my comment elsewhere, could do this here: https://github.com/storacha/etracker/blob/main/internal/db/egress/dynamodb.go#L70
There was a problem hiding this comment.
similar comment, similar reply 😛
|
now that feedback not related with the metrics endpoint is been addressed in #10 I'm rebasing this on main to focus on the metrics changes. Sorry for the hassle. |
cad0d3c to
7ce4282
Compare
| startCmd.Flags().String( | ||
| "grafana-metrics-token", | ||
| "", | ||
| "Grafana metrics token", | ||
| ) | ||
| cobra.CheckErr(viper.BindPFlag("grafana_metrics_token", startCmd.Flags().Lookup("grafana-metrics-token"))) | ||
|
|
There was a problem hiding this comment.
nit: Would prefer we use (add) an env var here. Or maybe we already have an env var for this, in which case disregard my comment; CLI args are visible in ps output and shell history, Env vars aren't perfect but they're significantly less exposed.
There was a problem hiding this comment.
it's an env var, storoku automatically adds env vars to make secrets available to the service. I created a flag with the same name so that the env var is picked up automatically. Is there an option to make viper read the env var without exposing a CLI arg?
There was a problem hiding this comment.
Yeah, https://pkg.go.dev/github.com/spf13/viper#BindEnv like you are using below. This isn't a blocking comment, just a suggestion for future changes.
There was a problem hiding this comment.
oh, I thought the existence of a flag was a pre-requisite for BindEnv to work. So I can remove the startCmd.Flags().String("metrics-auth-token", ...) definition and just issue a viper.BindEnv("metrics_auth_token")? Nice.
There was a problem hiding this comment.
roughly yes, it depends on how viper is configured in this project.
Based on https://github.com/storacha/etracker/blob/main/cmd/etracker/main.go#L45 that would bind the metrics_auth_token viper key to ETRACKER_METRICS_AUTH_TOKEN
| // Increment consolidated bytes counter for this node | ||
| metrics.ConsolidatedBytesPerNode.Add(ctx, int64(totalBytes), metric.WithAttributes(attribute.String("node_id", record.NodeID.String()))) | ||
|
|
There was a problem hiding this comment.
Should this happen after we issue the receipt? Ensuring we only count a metric if all operations in the loop are successful?
There was a problem hiding this comment.
yes, and it will. After the changes in #10 issuing the receipt manually is not necessary anymore.
There was a problem hiding this comment.
No I mean instead of running on an interval - just just a request to trigger the run.
| TF_VAR_private_key= # private_key or your env -- do not commit to repo! | ||
| TF_VAR_allowed_account_id=505595374361 | ||
| TF_VAR_region=us-west-2 | ||
| TF_VAR_grafana_metrics_token= # enter a value for GRAFANA_METRICS_TOKEN secret |
There was a problem hiding this comment.
🤷♂️ the endpoint returns metrics in prometheus format - the fact that we use grafana to scrape/graph it is implementation detail. I would call this something more like METRICS_AUTH_TOKEN.
| } | ||
|
|
||
| // TODO: do more validation here. | ||
| // At the very least that the invocation is a retrieval invocation and the audience is the node |
There was a problem hiding this comment.
No description provided.