Skip to content

feat: basic egress tracking prometheus metrics#6

Merged
volmedo merged 9 commits into
mainfrom
vic/feat/basic-metrics
Oct 8, 2025
Merged

feat: basic egress tracking prometheus metrics#6
volmedo merged 9 commits into
mainfrom
vic/feat/basic-metrics

Conversation

@volmedo

@volmedo volmedo commented Oct 6, 2025

Copy link
Copy Markdown
Contributor

No description provided.

@volmedo volmedo self-assigned this Oct 6, 2025
@volmedo volmedo changed the base branch from main to vic/feat/consolidator-mvp October 6, 2025 08:25
Base automatically changed from vic/feat/consolidator-mvp to main October 6, 2025 09:41

@alanshaw alanshaw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left comments on parts of the code I don't think were meant to be in this PR...

Comment thread cmd/etracker/start.go

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: trigger this when you get a request, unless it is already running.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I mean instead of running on an interval - just just a request to trigger the run.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so consolidate on every track request? I think processing them in batches might be slightly more efficient?

Comment thread internal/consolidator/consolidator.go Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something more like... "ignoring unsuccessful retrieval: %s", receipt.Ran()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error happens when trying to rebind the receipt, so I think it means the receipt is not of the expected type, no?

It being a retrieval receipt, but an unsuccessful one, is a different case that I'll check in validateReceipt.

Comment thread internal/consolidator/consolidator.go Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be issuing a receipt with the errors no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, this is not an error processing a single receipt but an error that prevents the consolidation of the full batch

Comment thread internal/consolidator/consolidator.go Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd register a handler and call it - then you don't need to do things like issue a receipt, you can just return the ok/error type from the handler and store the returned receipt.

Note: it doesn't need to go over the network.

Right now everything looks a little weird because you're kinda manually executing the invocation.

Comment thread internal/consolidator/consolidator.go Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a lot of data to buffer in memory - basically the size of a batch. Why not return a iter.Seq?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦🏻 I always forget about iterators. Good catch, thanks.

}

// TODO: do more validation here.
// At the very least that the invocation is a retrieval invocation and the audience is the node

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should invoke the validator, which will check the delegation chain.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, how do I do that?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

captured for posterity here #12

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List?

Suggested change
List(ctx context.Context, nodeDID did.DID) ([]ConsolidatedRecord, error)

Comment thread internal/server/handlers.go Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, and partly because I think Decode works with strings also, but we usually parse strings and decode bytes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, Decode accepts a string (it wouldn't compile otherwise) and Parse essentially calls Decode when the param is a string, but I'll update

Comment thread internal/service/service.go Outdated
return err
}

metrics.TrackedBatchesPerNode.WithLabelValues(nodeDID.String()).Inc()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to increment in the DB also and when you start the service, read it from the DB and set the counter value. Otherwise when you restart the service your counts will be off.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's necessary. Prometheus handles resets of the counter in functions like rate or increase. In a sense, the actual absolute value of the counter doesn't really matter.

@volmedo

volmedo commented Oct 6, 2025

Copy link
Copy Markdown
Contributor Author

I've left comments on parts of the code I don't think were meant to be in this PR...

sorry, I forgot to rebase the branch onto main after merging the branch this PR was targeting. The good thing about this is that I got your comments on #5 🙂. I'll take care of them in a follow up.

@@ -0,0 +1,26 @@
package metrics

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this can't use the opentelemetry go sdk?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I just wanted something quick and this is how I set up prometheus metrics endpoints in the past. As opposed to piri's case, this is a service we own so I can add secrets, and the endpoint is going to be scrapped by Grafana, so I just went the default route.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking (hoping even) we could move toward standardizing on OpenTelemetry across services - indexing-service, Piri, and (soon) guppy all use it.
The Go SDK has a native Prometheus exporter that works nearly the same way, so I don't see it adding much more complexity.

That said, totally get wanting to move fast here. Happy to help with OTel if you're interested, but no worries either way

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take a look at 531320f and let me know if it matches what you had in mind

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConsolidatedBytesPerNode could probably be an integer counter, will leave it up to you. Looks good otherwise - thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good catch, thank you!

Comment thread internal/consolidator/consolidator.go Outdated
Comment on lines +147 to +148
// Increment consolidated bytes counter for this node
metrics.ConsolidatedBytesPerNode.WithLabelValues(record.NodeID.String()).Add(float64(totalBytes))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: probably cleaner to handle the recording keeping of this metric within the DB method. i.e. within consolidatedTable.Add

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this, because the metric is not related with how the data is stored IMO. If we had more than one implementation of ConsolidatedTable we wouldn't want to have all and each of them to publish the metric.

Comment thread internal/service/service.go Outdated
return err
}

metrics.TrackedBatchesPerNode.WithLabelValues(nodeDID.String()).Inc()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment elsewhere, could do this here: https://github.com/storacha/etracker/blob/main/internal/db/egress/dynamodb.go#L70

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment, similar reply 😛

@volmedo

volmedo commented Oct 7, 2025

Copy link
Copy Markdown
Contributor Author

now that feedback not related with the metrics endpoint is been addressed in #10 I'm rebasing this on main to focus on the metrics changes. Sorry for the hassle.

@volmedo volmedo force-pushed the vic/feat/basic-metrics branch from cad0d3c to 7ce4282 Compare October 7, 2025 15:46
Comment thread cmd/etracker/start.go Outdated
Comment on lines +55 to +61
startCmd.Flags().String(
"grafana-metrics-token",
"",
"Grafana metrics token",
)
cobra.CheckErr(viper.BindPFlag("grafana_metrics_token", startCmd.Flags().Lookup("grafana-metrics-token")))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would prefer we use (add) an env var here. Or maybe we already have an env var for this, in which case disregard my comment; CLI args are visible in ps output and shell history, Env vars aren't perfect but they're significantly less exposed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's an env var, storoku automatically adds env vars to make secrets available to the service. I created a flag with the same name so that the env var is picked up automatically. Is there an option to make viper read the env var without exposing a CLI arg?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, https://pkg.go.dev/github.com/spf13/viper#BindEnv like you are using below. This isn't a blocking comment, just a suggestion for future changes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I thought the existence of a flag was a pre-requisite for BindEnv to work. So I can remove the startCmd.Flags().String("metrics-auth-token", ...) definition and just issue a viper.BindEnv("metrics_auth_token")? Nice.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roughly yes, it depends on how viper is configured in this project.

Based on https://github.com/storacha/etracker/blob/main/cmd/etracker/main.go#L45 that would bind the metrics_auth_token viper key to ETRACKER_METRICS_AUTH_TOKEN

Comment thread internal/consolidator/consolidator.go Outdated
Comment on lines +149 to +151
// Increment consolidated bytes counter for this node
metrics.ConsolidatedBytesPerNode.Add(ctx, int64(totalBytes), metric.WithAttributes(attribute.String("node_id", record.NodeID.String())))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this happen after we issue the receipt? Ensuring we only count a metric if all operations in the loop are successful?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and it will. After the changes in #10 issuing the receipt manually is not necessary anymore.

Comment thread cmd/etracker/start.go

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I mean instead of running on an interval - just just a request to trigger the run.

Comment thread deploy/.env.terraform.tpl Outdated
TF_VAR_private_key= # private_key or your env -- do not commit to repo!
TF_VAR_allowed_account_id=505595374361
TF_VAR_region=us-west-2
TF_VAR_grafana_metrics_token= # enter a value for GRAFANA_METRICS_TOKEN secret

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♂️ the endpoint returns metrics in prometheus format - the fact that we use grafana to scrape/graph it is implementation detail. I would call this something more like METRICS_AUTH_TOKEN.

}

// TODO: do more validation here.
// At the very least that the invocation is a retrieval invocation and the audience is the node

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volmedo volmedo merged commit 827cb4d into main Oct 8, 2025
4 of 5 checks passed
@volmedo volmedo deleted the vic/feat/basic-metrics branch October 8, 2025 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants