Skip to content

various additional metrics#3470

Merged
mschuwalow merged 8 commits into
mainfrom
better-metrics
May 19, 2026
Merged

various additional metrics#3470
mschuwalow merged 8 commits into
mainfrom
better-metrics

Conversation

@mschuwalow
Copy link
Copy Markdown
Contributor

No description provided.

@mschuwalow mschuwalow self-assigned this May 18, 2026
Comment thread golem-service-base/src/db/sqlite.rs Outdated
api_name = self.api_name,
"DB transaction rollback",
);
golem_common::metrics::db::record_db_rollback("sqlite", self.svc_name, self.api_name);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we have this for postgres too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clankers missed that one :D

/// Runs the periodic pool-metrics reporting loop.
/// Records `db_pool_size` and `db_pool_idle` for the given service name every 15 seconds.
/// Never returns — intended to be run as a background task.
pub async fn run_metrics_loop(&self, svc_name: &'static str) -> anyhow::Result<()> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we set the gauge on each pool size change? Why do it with this background loop?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see anything in sqlx that allows reacting to pool size changes

freed >= storage_bytes
}

/// Registers a background task in `join_set` that periodically records worker-status
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a background polling task for this.
There is a single point where all status changes are recorded: update_cached_status in the WorkerService implementation. There we could directly change the gauges, couldn't we?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still want to have this for the available memory / filesystem semaphore.

We could handle this inline on each acquire / release, but I feel that would impact readability of the code too much. Having it in the background in a loop (with 15 seconds update frequency currently) is not that much overhead while keeping things simpler.

But I can switch to inline metric updates, just wanted a confirmation before proceeding

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% against it :) just thought that if we can just set the gauages/counters where they change we don't have to think about whether this loop with the locks involved cause any perf issues or not, and also metrics would be "immediately" up-to-date for scrapers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done 652960d

}

pub fn merge(&mut self, mut other: Self) {
if let Some(other_permit) = other.permit.take() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed so drop of the other permit does not update the metrics

.map_or(0, |permit| permit.num_permits())
}

pub fn merge(&mut self, mut other: Self) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same pattern here


pub async fn remove(&self, agent_id: &AgentId) {
self.workers.remove(agent_id).await;
if let Some(worker) = self.workers.get(agent_id).await {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a return value to the remove instead unless it's surely never get double called in some races (which I don't know)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me do it using RAII in the worker itself instead.

@mschuwalow mschuwalow enabled auto-merge (squash) May 19, 2026 16:02
@mschuwalow mschuwalow merged commit ae191ca into main May 19, 2026
116 of 122 checks passed
@mschuwalow mschuwalow deleted the better-metrics branch May 19, 2026 18:34
@github-actions github-actions Bot locked and limited conversation to collaborators May 19, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants