Description
Observed behavior
We re-used the NATS JetStream Producer and PUSH Consumer examples from https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPushSubBasicAsync.java
and then killed the NATS leader node of the stream (we forcefully killed the VM hosting the Kube Worker Node).
Setup: In Memory Storage Option, 3 Replicas.
Config of createStream was changed to:
StreamConfiguration sc = StreamConfiguration.builder()
.name(streamName)
.storageType(storageType)
.subjects(subjects)
.replicas(3)
.description("LifeX-Test")
.build();
and connection builder to
Options.Builder builder = new Options.Builder()
.server(servers)
.connectionTimeout(Duration.ofMillis(500))
.pingInterval(Duration.ofSeconds(3))
.maxPingsOut(2)
.reconnectWait(Duration.ofMillis(500))
.connectionListener(EXAMPLE_CONNECTION_LISTENER)
.traceConnection()
.errorListener(EXAMPLE_ERROR_LISTENER)
.maxReconnects(-1)
.reconnectDelayHandler(new PsReconnectDelayHandler())
.reconnectJitter(Duration.ofMillis(500));
When connecting, we configure all 3 servers of the cluster and register connection, error and delay handlers (basically just logging the callbacks).
Setup Environment: NATS Cluster with 3 NATS Pods on top of RHAT Openshift Kubernetes cluster.
After the failure of the NATS master node the following happens:
- Producer detects the failure, reconnects and continues sending after ~15 seconds.
- Consumer DOES NOT detect the failure and is stuck. It does not log any error or disconnection info nor any new message reception in the message handler. Note that if we restart the consumer it can consume ALL message sent by the producer incl. the ones after the producer reconnected.
Note as well that the consumer aborts once the producer is done and deletes the stream. Then some disconnect log is printed.
Logs of NATS nodes are attached... I cannot really add logs of the java client as there are none as it is seems to just remain stuck indefinitely. UPDATE: Added java client logs with traceConnection settings and now we see more details.
The Client seems to reconnect and resubscribe, but still does NOT get any further messages pushed...
Expected behavior
Producer detects the failure, reconnects and continues sending.
Consumer detects the failure, reconnects and continues consuming.
Server and client version
Server: 2.9.22
Java Client: 2.16.14
Host environment
We used the official NATS container images and the HELM charts for deployment.
Steps to reproduce
Setup 3 node cluster on RHAT Openshift or any other Kubernetes Cluster
Start producer with settings as above > https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPub.java
Start consumer with settnigs as above > https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPushSubBasicAsync.java
Kill leader node (find stream leader via using nats cli)
Logs:
nats-server-2-logs.txt
nats-server-1-logs.txt
logs.zip