Skip to content

NATS Push Consumer Client Stuck after NATS Node Failure #997

Closed
@stefanLeo

Description

@stefanLeo

Observed behavior

We re-used the NATS JetStream Producer and PUSH Consumer examples from https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPushSubBasicAsync.java
and then killed the NATS leader node of the stream (we forcefully killed the VM hosting the Kube Worker Node).

Setup: In Memory Storage Option, 3 Replicas.

Config of createStream was changed to:
StreamConfiguration sc = StreamConfiguration.builder()
.name(streamName)
.storageType(storageType)
.subjects(subjects)
.replicas(3)
.description("LifeX-Test")
.build();

and connection builder to
Options.Builder builder = new Options.Builder()
.server(servers)
.connectionTimeout(Duration.ofMillis(500))
.pingInterval(Duration.ofSeconds(3))
.maxPingsOut(2)
.reconnectWait(Duration.ofMillis(500))
.connectionListener(EXAMPLE_CONNECTION_LISTENER)
.traceConnection()
.errorListener(EXAMPLE_ERROR_LISTENER)
.maxReconnects(-1)
.reconnectDelayHandler(new PsReconnectDelayHandler())
.reconnectJitter(Duration.ofMillis(500));

When connecting, we configure all 3 servers of the cluster and register connection, error and delay handlers (basically just logging the callbacks).

Setup Environment: NATS Cluster with 3 NATS Pods on top of RHAT Openshift Kubernetes cluster.

After the failure of the NATS master node the following happens:

  1. Producer detects the failure, reconnects and continues sending after ~15 seconds.
  2. Consumer DOES NOT detect the failure and is stuck. It does not log any error or disconnection info nor any new message reception in the message handler. Note that if we restart the consumer it can consume ALL message sent by the producer incl. the ones after the producer reconnected.
    Note as well that the consumer aborts once the producer is done and deletes the stream. Then some disconnect log is printed.

Logs of NATS nodes are attached... I cannot really add logs of the java client as there are none as it is seems to just remain stuck indefinitely. UPDATE: Added java client logs with traceConnection settings and now we see more details.
The Client seems to reconnect and resubscribe, but still does NOT get any further messages pushed...

Expected behavior

Producer detects the failure, reconnects and continues sending.
Consumer detects the failure, reconnects and continues consuming.

Server and client version

Server: 2.9.22
Java Client: 2.16.14

Host environment

We used the official NATS container images and the HELM charts for deployment.

Steps to reproduce

Setup 3 node cluster on RHAT Openshift or any other Kubernetes Cluster
Start producer with settings as above > https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPub.java
Start consumer with settnigs as above > https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPushSubBasicAsync.java
Kill leader node (find stream leader via using nats cli)

Logs:
nats-server-2-logs.txt
nats-server-1-logs.txt
logs.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    defectSuspected defect such as a bug or regression

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      pFad - Phonifier reborn

      Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

      Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


      Alternative Proxies:

      Alternative Proxy

      pFad Proxy

      pFad v3 Proxy

      pFad v4 Proxy