Skip to content

on_revoke callback passed partition list with partition not already in assignment after consumer group session timeout #1551

@keithks

Description

@keithks

Description

We are using a single confluent_kafka.Consumer for a partition group that is always unique (via generating a fresh uuid4 in the group)
We use on_assign and on_revoke callbacks in the consumer with the consumer.subscribe call

Our on_assign callback looks like this:

    def _set_and_assign_offsets(self, consumer: ConfluentConsumer, partitions: List[TopicPartition]):
        self._set_offsets_by_script_state(consumer, partitions)

        self._event_loop.call_soon_threadsafe(self._assignment_sync_event.set)

    def _set_offsets_by_script_state(self, consumer: ConfluentConsumer, partitions: List[TopicPartition]):
        for p in partitions:
            if p.topic in self._begin_topics or p.topic in self._snapshot_topics:
                p.offset = OFFSET_BEGINNING if self._initialising else OFFSET_STORED
            elif p.topic in self._end_topics:
                p.offset = OFFSET_END if self._initialising else OFFSET_STORED

        consumer.incremental_assign(partitions)

Our on_revoke callback looks like this:

def _on_revoke(self, consumer: ConfluentConsumer, partitions: List[TopicPartition]):
    _logger.info(f'These partitions have been revoked from the consumer {partitions}')
    consumer.incremental_unassign(partitions)

At some point we also use consumer.incremental_unassign to pause consumption for self._snapshot_topics
We use batch consume to consume messages from this consumer (num_messages > 1)

I think when our broker is doing some daily cleanup, we get kicked from the cluster leader, but it seems that the on_revoke callback is passed topic partitions that the consumer no longer has an assignment to

There is only 1 consumer in that cg, so we would expect all of TOPIC_2 to be assigned to this consumer

Also, this seems to happen without the on_revoke callback also. Initially we were happy with the defaults

"2023-04-10T13:05:12.080Z","<HOSTNAME>","%4|1681131912.080|SESSTMOUT|<CONSUMER_NAME>| [thrd:main]: Consumer group session timed out (in join-state steady) after 10498 ms without a successful response from the group coordinator (broker 3, last error was Success): revoking assignment and rejoining group"
"2023-04-10T13:05:12.081Z","<HOSTNAME>","%4|1681131912.081|ASSIGN|<CONSUMER_NAME>| [thrd:main]: Group ""<GROUP_NAME>"": application *assign() call failed: <TOPIC_2> [0] can't be unassigned since it is not in the current assignment"
"2023-04-10T13:05:12.081Z","<HOSTNAME>","event='These partitions have been revoked from the consumer [TopicPartition{topic=<TOPIC_1>,partition=0,offset=-1001,error=None}, TopicPartition{topic=<TOPIC_1>,partition=1,offset=-1001,error=None}, TopicPartition{topic=<TOPIC_1>,partition=2,offset=-1001,error=None}, TopicPartition{topic=<TOPIC_1>,partition=3,offset=-1001,error=None}, TopicPartition{topic=<TOPIC_1>,partition=4,offset=-1001,error=None}, TopicPartition{topic=<TOPIC_2>,partition=0,offset=-1001,error=None}, TopicPartition{topic=<TOPIC_2>,partition=1,offset=-1001,error=None}, TopicPartition{topic=<TOPIC_2>,partition=2,offset=-1001,error=None}, TopicPartition{topic=<TOPIC_2>,partition=3,offset=-1001,error=None}, TopicPartition{topic=<TOPIC_2>,partition=4,offset=-1001,error=None}, TopicPartition{topic=<TOPIC_3>,partition=0,offset=-1001,error=None}, TopicPartition{topic=<TOPIC_3>,partition=1,offset=-1001,error=None}, TopicPartition{topic=<TOPIC_3>,partition=2,offset=-1001,error=None}, TopicPartition{topic=<TOPIC_3>,partition=3,offset=-1001,error=None}, TopicPartition{topic=<TOPIC_3>,partition=4,offset=-1001,error=None}]' logger='<package>.kafka.Consumer' level='info' timestamp='2023-04-10T13:05:12.080859Z'"

How to reproduce

Using confluent_kafka 2.0.2 and librdkafka 2.0.2

Broker version is 2.6-IV0

Use the above setup and then maybe shut down the leader to guarantee a repro. We get this nightly so it is pretty consistent

Checklist

Please provide the following information:

  • confluent-kafka-python and librdkafka version (confluent_kafka.version() and confluent_kafka.libversion()):
  • Apache Kafka broker version: 2.6-IV0
  • Client configuration: {'auto.offset.reset': 'smallest', 'partition.assignment.strategy': 'cooperative-sticky', 'check.crcs': True, 'connections.max.idle.ms': 540000, 'enable.auto.commit': True, 'enable.partition.eof': True, 'fetch.wait.max.ms': 20, 'max.in.flight': 5, 'metadata.max.age.ms': 300000, 'reconnect.backoff.max.ms': 1000, 'reconnect.backoff.ms': 50, 'session.timeout.ms': 10000, 'socket.receive.buffer.bytes': 16777216, 'socket.send.buffer.bytes': 16777216, 'socket.timeout.ms': 30000}
  • Operating system: CentOS 8 on Docker
  • Provide client logs (with 'debug': '..' as necessary)
  • Provide broker log excerpts
  • Critical issue: No - working around by restarting the application

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugReporting an unexpected or problematic behavior of the codebaseinvestigate furtherIt's unclear what the issue is at this time but there is enough interest to look into it

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      pFad - Phonifier reborn

      Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

      Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


      Alternative Proxies:

      Alternative Proxy

      pFad Proxy

      pFad v3 Proxy

      pFad v4 Proxy