Reconcile QQ node dead during delete and redeclare #14241

LoisSotoLopez · 2025-07-16T11:11:49Z

Proposed Changes

This PR implements the suggested solution for the issue described in discussion #13131

Currently, when a QQ is deleted and re-declared while one of its nodes is dead, this dead node won't be able to reconcile with the new queue.

In this PR we add the list of ra UIds for the cluster to each node queue record, so that when a Rabbit node recovers a queue it will be able to detect the situation described above and properly reconcile .

Types of Changes

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

Co-authored-by: Péter Gömöri <gomoripeti@users.noreply.github.com>

kjnilsson · 2025-07-16T15:28:42Z

thanks @LoisSotoLopez - this looks like it will do what we discussed a while back. I feel unsure about adding another key to the queue type state for this, mainly becuause we'd have to keep uids and nodes in sync. It would be nicer if nodes turned from a list into a map although even this is a bit controversial and could become a source of bugs. Let me consider it for a day or two.

kjnilsson

I don't like having two keys with similar information (nodes) that will need to be kept in sync. I think we need to move the current nodes list value to the the #{node() => uid()} map format and handle the two formats in the relevant places, mostly in rabbit_queue_type. we do need a list of the member nodes in a few places but we could add a convenience functio: rabbit_queue_type:nodes/1 that takes a queue record and returns a list of member nodes. internally it could just call rabbit_queue_type:info(Q, [members]) and extract the result from that then update all places where we explicity use get_type_state to extract the member nodes.

In addition I think we need to put the use of nodes as a map behind a feature flag to avoid new queue records with nodes map values being created in a mixed versions cluster.

We are moving the functionality of getting the nodes/members of an amqqueue from the `amqqueue` module to `rabbit_amqqueue`. This goes in the line of previous PRs work towards reducing direct access to the `QueueTypeState`, such as rabbitmq#13905. Also, we will need to discretize different formats of the `nodes` entry in the `QueueTypeState`, to support both the previous one as a list of nodes and the new one as a map of nodes to Ra UIds. Doing so in a module such as `amqqueue`, which feels like an accessor module around the `amqqueue` record, doesn't feel right.

LoisSotoLopez · 2025-07-25T12:09:00Z

@kjnilsson Thanks for the suggestions. Just wanted to let you know we are working on this. Had some incident we had to take care of this week but I'll be pushing this PR forward next week.

michaelklishin · 2025-07-28T21:48:53Z

@LoisSotoLopez we have to ask your employer to sign the Broadcom CLA before we can accept this contribution (or its future finished version).

It is about one page long, nothing particularly unusual or onerous.

LoisSotoLopez · 2025-08-01T08:58:02Z

That commit below is just for showing current progress. Have been struggling to understand why a few of the remaining tests fail.

LoisSotoLopez · 2025-08-05T09:34:09Z

deps/rabbit/src/rabbit_quorum_queue.erl

@@ -791,6 +810,23 @@ recover(_Vhost, Queues) ->
         ServerId = {Name, node()},
         QName = amqqueue:get_name(Q0),
         MutConf = make_mutable_config(Q0),
+         RaUId = ra_directory:uid_of(?RA_SYSTEM, Name),


Figured out what I was doing wrong but I'm not sure what's the best approach to fix it. Let me recap and raise some questions.

So when a QQ is declared I am setting the #{node() => ra_uid()} map in the Queue Type State. I'll refer to those values in the map as NodeRaUids, as opposed to the UId associated to each QQ as a ra cluster ( the one generated by calling rabbit_quorum_queue:make_ra_conf), which I'll refer to as ClusterRaUid.

When a queue is deleted and re-declared the Queue Type state gets re-generated (no doubt about that because it's a new queue). Therefore between queue reincarnations the Queue Type state will change. However, the ClusterRaUid will not change for those nodes that were dead during the delete+redeclare.

I was checking, on member recovery, whether the ClusterRaUid, as retrieved with ra_directory:uid_of, was associated to the current node in the map of NodeRaUids. That's not right.

My current approach is storing the NodeRaUids map in disk, as it is done for ClusterRaId, and on recover compare that in-disk map with the one in the QueueTypeState.

My questions for you guys are: does that last paragraph sound right? and if it does, what API should I use to store that in-disk copy of the NodeRaUids map? Is there any ra_ module that I could use?

If we already store ClusterRaId that way, I guess we can persist more metadata that the Ra members won't otherwise have/preserve.

The ClusterRaUid gets persisted through ra_directory.erl which seems designed to strictly storing Ra cluster's specific pieces of related information. Maybe we just need a qq_nodes_uids dets for the sole purpose of storing those NodeRaUids. Or using Rabbit's metadata storage.

Edited: forget about this, found the error that led to this confusion.

michaelklishin · 2025-08-05T16:20:31Z

@LoisSotoLopez sorry, any updates or feedback on the new CLA? We cannot accept any contributions without a signed CLA at the moment, and this change won't qualify for a "trivial" one, even if those get exempt in the future.

We are trying to make it digital one way or another but there's a risk that the process will stay what it currently is (just a document to sign and email).

LoisSotoLopez · 2025-08-06T07:38:01Z

@LoisSotoLopez sorry, any updates or feedback on the new CLA? We cannot accept any contributions without a signed CLA at the moment, and this change won't qualify for a "trivial" one, even if those get exempt in the future.

We are trying to make it digital one way or another but there's a risk that the process will stay what it currently is (just a document to sign and email).

Yes, sorry about not having this already addressed. We are on summer vacations right now so the people who will be signing it for the whole company won't be able to do it until next week. I would do it myself but can't due to a intellectual property clause on my contract that wasn't there the last time I signed the CLA.

Will try to get it signed asap.

Reconcile QQ node dead during delete and redeclare

72a48e9

Co-authored-by: Péter Gömöri <gomoripeti@users.noreply.github.com>

kjnilsson self-requested a review July 16, 2025 16:42

LoisSotoLopez force-pushed the qq_uuid_in_metadata_store branch from dac1a44 to 72a48e9 Compare July 17, 2025 10:29

kjnilsson mentioned this pull request Jul 17, 2025

rabbitmq-queues: a command to display a member with highest (commit, log, snapshot) index #14237

Open

12 tasks

kjnilsson requested changes Jul 21, 2025

View reviewed changes

gomoripeti mentioned this pull request Jul 24, 2025

Classic queues: return basic info items without calling queue process #14280

Merged

12 tasks

mergify bot mentioned this pull request Jul 24, 2025

Classic queues: return basic info items without calling queue process (backport #14280) #14281

Merged

12 tasks

wip: add feature flag and put RaUids in nodes

35ef780

LoisSotoLopez force-pushed the qq_uuid_in_metadata_store branch from 1074165 to 35ef780 Compare August 1, 2025 09:50

LoisSotoLopez commented Aug 5, 2025

View reviewed changes

Update nodes entry on make_ra_conf

f48962d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reconcile QQ node dead during delete and redeclare #14241

Reconcile QQ node dead during delete and redeclare #14241

Uh oh!

LoisSotoLopez commented Jul 16, 2025 •

edited

Loading

Uh oh!

kjnilsson commented Jul 16, 2025

Uh oh!

kjnilsson left a comment •

edited

Loading

Uh oh!

LoisSotoLopez commented Jul 25, 2025

Uh oh!

michaelklishin commented Jul 28, 2025

Uh oh!

LoisSotoLopez commented Aug 1, 2025 •

edited

Loading

Uh oh!

LoisSotoLopez Aug 5, 2025 •

edited

Loading

Uh oh!

michaelklishin Aug 5, 2025

Uh oh!

LoisSotoLopez Aug 6, 2025 •

edited

Loading

Uh oh!

michaelklishin commented Aug 5, 2025

Uh oh!

LoisSotoLopez commented Aug 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Reconcile QQ node dead during delete and redeclare #14241

Are you sure you want to change the base?

Reconcile QQ node dead during delete and redeclare #14241

Uh oh!

Conversation

LoisSotoLopez commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

Types of Changes

Checklist

Further Comments

Uh oh!

kjnilsson commented Jul 16, 2025

Uh oh!

kjnilsson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LoisSotoLopez commented Jul 25, 2025

Uh oh!

michaelklishin commented Jul 28, 2025

Uh oh!

LoisSotoLopez commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LoisSotoLopez Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelklishin Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

LoisSotoLopez Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelklishin commented Aug 5, 2025

Uh oh!

LoisSotoLopez commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

LoisSotoLopez commented Jul 16, 2025 •

edited

Loading

kjnilsson left a comment •

edited

Loading

LoisSotoLopez commented Aug 1, 2025 •

edited

Loading

LoisSotoLopez Aug 5, 2025 •

edited

Loading

LoisSotoLopez Aug 6, 2025 •

edited

Loading

LoisSotoLopez commented Aug 6, 2025 •

edited

Loading