various additional metrics#3470
Conversation
| api_name = self.api_name, | ||
| "DB transaction rollback", | ||
| ); | ||
| golem_common::metrics::db::record_db_rollback("sqlite", self.svc_name, self.api_name); |
There was a problem hiding this comment.
shouldn't we have this for postgres too?
There was a problem hiding this comment.
The clankers missed that one :D
| /// Runs the periodic pool-metrics reporting loop. | ||
| /// Records `db_pool_size` and `db_pool_idle` for the given service name every 15 seconds. | ||
| /// Never returns — intended to be run as a background task. | ||
| pub async fn run_metrics_loop(&self, svc_name: &'static str) -> anyhow::Result<()> { |
There was a problem hiding this comment.
Can't we set the gauge on each pool size change? Why do it with this background loop?
There was a problem hiding this comment.
I did not see anything in sqlx that allows reacting to pool size changes
| freed >= storage_bytes | ||
| } | ||
|
|
||
| /// Registers a background task in `join_set` that periodically records worker-status |
There was a problem hiding this comment.
I don't think we need a background polling task for this.
There is a single point where all status changes are recorded: update_cached_status in the WorkerService implementation. There we could directly change the gauges, couldn't we?
There was a problem hiding this comment.
I would still want to have this for the available memory / filesystem semaphore.
We could handle this inline on each acquire / release, but I feel that would impact readability of the code too much. Having it in the background in a loop (with 15 seconds update frequency currently) is not that much overhead while keeping things simpler.
But I can switch to inline metric updates, just wanted a confirmation before proceeding
There was a problem hiding this comment.
I'm not 100% against it :) just thought that if we can just set the gauages/counters where they change we don't have to think about whether this loop with the locks involved cause any perf issues or not, and also metrics would be "immediately" up-to-date for scrapers.
| } | ||
|
|
||
| pub fn merge(&mut self, mut other: Self) { | ||
| if let Some(other_permit) = other.permit.take() { |
There was a problem hiding this comment.
This is needed so drop of the other permit does not update the metrics
| .map_or(0, |permit| permit.num_permits()) | ||
| } | ||
|
|
||
| pub fn merge(&mut self, mut other: Self) { |
There was a problem hiding this comment.
same pattern here
|
|
||
| pub async fn remove(&self, agent_id: &AgentId) { | ||
| self.workers.remove(agent_id).await; | ||
| if let Some(worker) = self.workers.get(agent_id).await { |
There was a problem hiding this comment.
I think we should add a return value to the remove instead unless it's surely never get double called in some races (which I don't know)
There was a problem hiding this comment.
let me do it using RAII in the worker itself instead.
No description provided.