Recommended way to partial historical data?

Description
===========
I am trying to obtain a few older messages (not all of them) for each of several topics (with just one partition per topic and one consumer per group). I have found a complicated way to to do it, but am asking if there is something simpler.

For sake of argument, say I want 10 older messages, if they are available, else fewer. What works is to specify an `on_assign` callback for Consumer.subscribe. Then in the callback function:

* If NOT called before (I only want historical data the first time), for each offset/topic:
  * Call Consumer.get_watermark_offsets to get the current offset. (I tried Consumer.positions but it always gives -1001).
  * Set partition.offset to the desired new position (e.g. current offset - 10)
* Call Consumer.assign(partitions)

This seems overly complicated, especially as I only want to move the offsets back when I first subscribe to the topics. Is there a simpler or better way? Even a way to make the callback only be called once would be nice.

Things I have tried:
* Call Consumer.positions and Consumer.seek after subscribing. This always fails, perhaps because topics have not been assigned.
* Call poll and then Consumer.positions and Consumer.seek. Doesn't seem to help, and I really don't want any data showing up until I have adjusted the offset.

How to reproduce
================

It's a question, not a bug, but here is the simplest example I could come up with. I'm hoping for something simpler than the on_assign_callback method. 

```
import base64
import os

from confluent_kafka import OFFSET_BEGINNING, Consumer, Producer
from confluent_kafka.admin import AdminClient, NewTopic

BROKER_ADDR = "broker:29092"

# The maximum number of historical messages we want
MAX_HISTORY_DEPTH = 10


def get_random_string() -> str:
    """Get a random string."""
    return base64.urlsafe_b64encode(os.urandom(12)).decode().replace("=", "_")


class HistoryExample:
    def __init__(self):
        base_name = get_random_string()
        self.topic_names = [f"historical.{base_name}.topic{i}" for i in range(3)]
        self.on_assign_called = False

    def setup(self):
        # Create topics
        broker_client = AdminClient({"bootstrap.servers": BROKER_ADDR})
        for topic_name in self.topic_names:
            new_topic = NewTopic(
                topic=topic_name, num_partitions=1, replication_factor=1
            )
            create_result = broker_client.create_topics([new_topic])
            for _, future in create_result.items():
                assert future.exception() is None
            print(f"created topic {topic_name}")

        # Create producer
        self.producer = Producer(
            {
                "acks": 1,
                "queue.buffering.max.ms": 0,
                "bootstrap.servers": BROKER_ADDR,
            }
        )

        # Create lots of historical data for topic 0
        # a few items for topic 1 and none for topic 2
        for topic_name, num_hist_messages in zip(self.topic_names, (20, 3, 0)):
            for i in range(num_hist_messages):
                raw_data = f"{topic_name} message {i}".encode()
                self.producer.produce(topic_name, raw_data)
                self.producer.flush()

        # Create the consumer
        self.consumer = Consumer(
            {
                "bootstrap.servers": BROKER_ADDR,
                # Make sure every consumer is in its own consumer group,
                # since each consumer acts independently.
                "group.id": get_random_string(),
                # Require explicit topic creation, so we can control
                # topic configuration, and to reduce startup latency.
                "allow.auto.create.topics": False,
                # Protect against a race condition in the on_assign callback:
                # if the broker purges data while the on_assign callback
                # is assigning the desired historical data offset,
                # data might no longer exist at the assigned offset;
                # in that case read from the earliest data.
                "auto.offset.reset": "earliest",
            }
        )
        self.consumer.subscribe(self.topic_names, on_assign=self.on_assign_callback)

    def on_assign_callback(self, consumer, partitions):
        """Set partition offsets to read historical data.

        Intended as the Consumer.subscribe on_assign callback function.

        Parameters
        ----------
        consumer
            Kafka consumer.
        partitions
            List of TopicPartitions assigned to consumer.

        Notes
        -----
        Read the current low and high watermarks for each topic.
        For topics that want history and have existing data,
        record the high watermark in ``self._history_position``.
        Adjust the offset of the partitions to get the desired
        amount of historical data.

        Note: the on_assign callback must call consumer.assign
        with all partitions passed in, and it also must set the ``offset``
        attribute of each of these partitions, regardless if whether want
        historical data for that topic.
        """
        for partition in partitions:
            min_offset, max_offset = consumer.get_watermark_offsets(
                partition, cached=False
            )
            print(f"{partition.topic} {min_offset=}, {max_offset=}")
            if max_offset <= 0:
                # No data yet written, and we want all new data.
                # Use OFFSET_BEGINNING in case new data arrives
                # while we are assigning.
                partition.offset = OFFSET_BEGINNING
                continue

            if self.on_assign_called:
                # No more historical data wanted; start from now
                partition.offset = max_offset
            else:
                desired_offset = max_offset - MAX_HISTORY_DEPTH
                if desired_offset <= min_offset:
                    desired_offset = OFFSET_BEGINNING
                partition.offset = desired_offset

        consumer.assign(partitions)

    def consume(self):
        while True:
            message = self.consumer.poll(0.1)
            if message is None:
                continue
            message_error = message.error()
            if message_error is not None:
                print(f"Ignoring Kafka message with error {message_error!r}")
                continue
            print(f"Read {message.topic()}: {message.value()}")

example = HistoryExample()
example.setup()
example.consume()
```

Checklist
=========
Please provide the following information:

 - [ x] confluent-kafka-python and librdkafka version (`confluent_kafka.version()` and `confluent_kafka.libversion()`):
   confluent_kafka.version(): ('1.9.2', 17367552)
   confluent_kafka.libversion(): ('1.9.2', 17367807)
 - [ x] Apache Kafka broker version:
   confluentinc/cp-enterprise-kafka   7.3.0 
   Running with docker-compose.
 - [ x] Client configuration: `{...}`
    See how to reproduce, above
 - [x ] Operating system:
   macOS 13.1, with Kafka running on a Docker image
 - [ ] Provide client logs (with `'debug': '..'` as necessary)
 - [ ] Provide broker log excerpts
 - [ ] Critical issue



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recommended way to partial historical data? #1509

Description

How to reproduce

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Recommended way to partial historical data? #1509

Description

Description

How to reproduce

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.