Open
Description
Description
We want to improve observability around the prebuild reconciliation process by adding metrics derived from the ReconciliationState
and ReconciliationActions
.
The metrics should capture key aspects of the reconciliation, such as:
- The current state of prebuilds per preset (counts of running, expired, desired, eligible, extraneous, and those in transitional states like starting, stopping, deleting).
- The reconciliation actions taken during each run (number of prebuilds created, deleted, or backoff events triggered).
These metrics should be emitted at every reconciliation run to track system behavior over time. This will enable monitoring of prebuild lifecycle health, early detection of anomalies (e.g., excessive expired prebuilds or frequent backoffs), and support alerting and analysis.
The objective is to review the code and determine an appropriate set of metrics that provide a comprehensive overview of the prebuilds reconciliation loop’s state and behavior.
Example metrics to consider (mainly focused on expired prebuilds)
- Expired prebuilds:
reconciler_expired_prebuilds_total
: Number of prebuilds automatically invalidated and deleted due to a preset's configuredttl
.reconciler_oldest_unclaimed_prebuild_age_seconds
: Age in seconds of the oldest unclaimed prebuild per preset.reconciler_presets_with_ttl_ratio
: Ratio of presets withttl
set to total presets.
- Reconciliation actions:
reconciler_actions_total{action=create|delete|backoff}
: Counter of reconciliation actions taken, by type.reconciler_extraneous_prebuilds_total
: Number of running non-expired prebuilds above the desired count.reconciler_missing_prebuilds_total
: Number of prebuilds that should have been created but aren’t yet.reconciler_backoff_total
: Number of times reconciliation skipped due to backoff.