Skip to content

perf: don't call GetUserByID unnecessarily for Agents metrics loops #19395

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 21, 2025

Conversation

cstyan
Copy link
Contributor

@cstyan cstyan commented Aug 18, 2025

At the moment, the loop which retrieves and updates the values of the agents metrics excessively calls GetUserByID (a DB query). First it retrieves a list of all workspaces, filtering out inactive agents (not entirely clear to me whether this is non-running workspaces, or just dead agents), and then iterates over those workspaces to get the rest of the relevant data for the metrics. The next call is GetUserByID for workspace.OwnerID

This should at least partially resolve coder/internal#726 by caching seen User uuid.UUID in a map for each iteration of the loop.

UPDATE: we now have a constraint for the username field in the users table which allows us to safely access the username field from the workspaces_expanded view. See #19453

UPDATE: It looks like, in theory, the calls here for GetUserByID should not even be necessary as we already have a database.Workspace object which also already has the owner ID and Username.

I left comments in both spots as to why the username should never be empty on the workspace again, but I'll reiterate here:
1. The owner_id field on the workspaces table is a FK reference to IDs in the users table and has a NOT NULL constraint, so the owner fields of a workspace will always be populated
2. While the users table technically only has a constraint that the username has to be NOT NULL (meaning empty string is valid), at user creation time our httpmw package enforces non-empty usernames (for example it calls codersdk.NameValid which enforces that the name is at least 1 character and fits the UsernameValidRegex)
3. The workspaces_expanded view has an inner join on workspaces.owner_id = visible_users.id, and if the owner is valid in the users table (which visible_users is a view of) then they will have a username set

reduce calls to GetUserByID

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing as the only referenced field of user is Username, why not use workspace.OwnerUsername instead?

@cstyan
Copy link
Contributor Author

cstyan commented Aug 18, 2025

Seeing as the only referenced field of user is Username, why not use workspace.OwnerUsername instead?

Good point, I'll make that change so that we're only caching the username in the map 👍

cstyan added 3 commits August 18, 2025 17:33
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Comment on lines 337 to 348
if username == "" {
logger.Warn(ctx, "in prometheusmetrics.Agents the username on workspace was empty string, this should not be possible",
slog.F("workspace_id", workspace.ID),
slog.F("template_id", workspace.TemplateID))
// Fallback to GetUserByID if OwnerUsername is empty (e.g., in tests)
user, err := db.GetUserByID(ctx, workspace.OwnerID)
if err != nil {
logger.Error(ctx, "can't get user from the database", slog.F("user_id", workspace.OwnerID), slog.Error(err))
agentsGauge.WithLabelValues(VectorOperationAdd, 0, user.Username, workspace.Name, templateName, templateVersionName)
continue
}
username = user.Username
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this situation would only be possible if we completely messed up the view. In this case, many other things would be also broken. However, in this case we not only potentially do a whole bunch of user lookups, we also spam a bunch of log errors.

IMO this should just error loudly so it's very obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this situation would only be possible if we completely messed up the view.

Yes it seems that way, though I'm not sure if there's some other edge case where this could happen, in which case dropping the fallback path of querying GetUserByID would techncially be a breaking change.

Removing the logging (warning level for the "workspace did not have a username set") feels reasonable. I would lean towards keeping the fallback DB query unless we have no requirements around breaking changes for metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean towards keeping the fallback DB query unless we have no requirements around breaking changes for metrics.

My point is, if we mess up the view in such a way that OwnerUsername is empty, it's very likely that more things would be broken than metrics. Having a non-empty username is a pretty fundamental assumption baked into the codebase right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but it could be possible that OwnerUsername is empty for only a very small subset of workspaces.

Is your suggestion that we error updating the metric as a whole (for all found workspace agents) if we see any empty username, or just for those workspaces which have an empty username (and do not call the GetUserByID query as a fallback)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two possible scenarios here:

  1. Some users end up actually having empty usernames (which is theoretically possible based on the database schema). In this case, there would be no benefit from calling GetUserByID.

  2. Some error updating the view results in an empty string being returned for owner_username. As far as I can tell, this would break things like application routing, SSH access, and also be distinctly visible in the UI. In this case, we have three options:
    a) Continue submitting metrics with an empty username field,
    b) Fall back to the GetUserByID query,
    c) Skip over the user/agent.
    d) Refuse to generate potentially invalid metrics at all, error and exit early.

a) could actually signal the issue more quickly, but at the cost of messed up metrics.
b) would likely correct the issue, but we would suddenly be performing more database queries unexpectedly.
c) would also surface the error in a way
d) might actually be overkill now that I think about it.

Out of curiousity, I decided to see what would happen if a migration messed up the view and pushed #19426. The main takeaway I got from that is that a number of existing tests failed, but -- most notably -- coderd/prometheusmetrics and coderd/workspacestats didn't fail. If this is an area of concern, I'd suggest modifying the existing tests to guard for this so that we catch it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiousity, I decided to see what would happen if a migration messed up the view and pushed #19426. The main takeaway I got from that is that a number of existing tests failed, but -- most notably -- coderd/prometheusmetrics and coderd/workspacestats didn't fail. If this is an area of concern, I'd suggest modifying the existing tests to guard for this so that we catch it.

IIUC that's because we would still have the GetUserByID call, so the view doesn't matter, we're taking the workspace owner ID and looking up that user.

I'm not sure I understand your point about B would likely correct the issue, but we would suddenly be performing more database queries unexpectedly since we're currently making a DB call for every active workspace.

Unless we want to introduce a potentially breaking change I think B is our only option. Otherwise we can remove the fallback query and go with option C, then workspaces with correct usernames set would still emit agent related metrics.

@cstyan cstyan changed the title perf: cache the seen user IDs in each iteration of the Agents metric loop perf: don't call GetUserByID unnecessarily for Agents metrics loops Aug 19, 2025
continue
username := workspace.OwnerUsername
// This should never be possible, but we'll guard against it just in case.
// 1. The owner_id field on the workspaces table is a reference to IDs in the users table and has a `NOT NULL` constraint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not just add a foreign key constraint on this field? Seems like an odd omission.
Along with that, a constraint to enforce the minimum length of the username would add a lot more integrity.

This would likely render the rest of this discussion moot as this situation would not be possible.
I do appreciate the defensiveness of this code though, thanks @cstyan 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good observation @dannykopping - I might suggest a separate PR though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea 👍 that PR should block this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the FK constraint suggestion, what would that look like or do for us? The query here just uses a view workspaces_expanded, and the ownerID on the workspaces table is already a FK to users. So I think it's just the constraint to enforce a minimum username length that we need to add?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh whoops, I didn't see the FK at the bottom of the file: https://github.com/coder/coder/blob/main/coderd/database/dump.sql#L3325-L3326

cstyan added a commit that referenced this pull request Aug 21, 2025
Username length and format, via regex, are already enforced at the
application layer, but we have some code paths with database queries
where we could optimize away many of the DB query calls if we could be
sure at the database level that the username is never an empty string.

For example: #19395

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
constraint to users table

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan merged commit 014a2d5 into main Aug 21, 2025
49 of 52 checks passed
@cstyan cstyan deleted the callum/reduce-agent-metric-db-calls branch August 21, 2025 18:01
@github-actions github-actions bot locked and limited conversation to collaborators Aug 21, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: GetUserByID is called millions of times per day on dogfood
3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy