Unable to upgrade to 3.9.0: Could not upgrade metadata.version to 21 #11147

osklil · 2025-01-15T05:58:50Z

osklil
Jan 15, 2025

Bug Description

I have upgraded operator from 0.44.0 to 0.45.0 and then edited the Kafka CRD to change spec.kafka.version from 3.8.0 to 3.9.0. The pods recreated with new image but the upgrade did not complete. Now the cluster got this status:

  status:
    conditions:
    - lastTransitionTime: "2025-01-15T05:37:35.221452943Z"
      message: Failed to update metadata version to 3.9
      reason: MetadataUpdateFailed
      status: "True"
      type: Warning

I tried manual upgrade:

$ kubectl exec -n … --context=… kafka-cluster-kafka-0 -- /opt/kafka/bin/kafka-features.sh --bootstrap-server localhost:9092 upgrade --metadata 3.9
Could not upgrade metadata.version to 21. Invalid update version 21 for feature metadata.version. Controller 5 only supports versions 1-20
1 out of 1 operation(s) failed.
command terminated with exit code 1
$

Steps to reproduce

No response

Expected behavior

No response

Strimzi version

0.45.0

Kubernetes version

1.27.11

Installation method

Helm

Infrastructure

Bare-metal

Configuration files and logs

No response

Additional context

No response

scholzj · 2025-01-15T09:26:54Z

scholzj
Jan 15, 2025
Maintainer

You probably scaled the Kafka cluster down in the past with an older Strimzi version and Kafka still has this node registered but invisible because of missing APIs. This is not a Strimzi bug but a Kafka KRaft limitation. It should be addressed only in Kafka 4.0.

You have to work around it manually by unregistering the node using the Kafka Admin API.

0 replies

osklil · 2025-01-15T09:47:45Z

osklil
Jan 15, 2025
Author

@scholzj Thanks, that could be it. It seems the command line tools to list and unregister nodes aren't available with 3.9.0, is this correct?

0 replies

scholzj · 2025-01-15T10:00:25Z

scholzj
Jan 15, 2025
Maintainer

There was no command line tool for it. But I'm not sure I ever checked in Kafka 3.9 - maybe someone added it there in that version.

You could also try to scale-up (add the node reported in the error message) and scale it down again. New Strimzi versions try to workaround this Kafka limitation and should unregister the node. But if it was controller, the scaling is tricky as that is another unsupported thing :-/. You can also try to add it to the .status.registeredNodeIds list in the Kafka CR with kubectl edit kafka my-cluster --subresource=status to trigger the unregistration.

0 replies

ppatierno · 2025-01-23T16:40:22Z

ppatierno
Jan 23, 2025
Maintainer

Triaged on 23.1.2025: it seems the kafka-cluster.sh tool support the unregister option to do that. @osklil can you try and let us know if that didn't work for you?

0 replies

osklil · 2025-01-23T20:32:47Z

osklil
Jan 23, 2025
Author

Not sure I did the right thing, but it complains that the "given broker ID was not registered" (tried with id >= 3).

[kafka@kafka-cluster-kafka-0 kafka]$ /opt/kafka/bin/kafka-cluster.sh unregister --bootstrap-server localhost:9092  --id 5
[2025-01-23 20:31:45,015] ERROR [AdminClient clientId=adminclient-1] Unregister broker request for broker ID 5 failed: The given broker ID was not registered. (org.apache.kafka.clients.admin.KafkaAdminClient)
org.apache.kafka.common.errors.BrokerIdNotRegisteredException: The given broker ID was not registered.
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.BrokerIdNotRegisteredException: The given broker ID was not registered.
        at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
        at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165)
        at org.apache.kafka.tools.ClusterTool.unregisterCommand(ClusterTool.java:127)
        at org.apache.kafka.tools.ClusterTool.execute(ClusterTool.java:107)
        at org.apache.kafka.tools.ClusterTool.mainNoExit(ClusterTool.java:48)
        at org.apache.kafka.tools.ClusterTool.main(ClusterTool.java:43)
Caused by: org.apache.kafka.common.errors.BrokerIdNotRegisteredException: The given broker ID was not registered.

[kafka@kafka-cluster-kafka-0 kafka]$

0 replies

ppatierno · 2025-01-24T07:43:09Z

ppatierno
Jan 24, 2025
Maintainer

How did you get it was the right broker id when you scaled down?

0 replies

osklil · 2025-01-24T07:47:36Z

osklil
Jan 24, 2025
Author

0, 1, 2 are the current broker/controller combos, and at some point there was 0, 1, 2 as brokers and 3, 4, 5 that were controller only IIRC. At least I assumed that those were the ids needed for kafka-cluster.sh unregister.

0 replies

ppatierno · 2025-01-24T07:59:18Z

ppatierno
Jan 24, 2025
Maintainer

So you scaled down controllers which is something not really supported by KRaft right now. The quorum is static and dynamic quorum (with controllers to be scaled down), will come with Kafka 4.x. So I guess this is the reason why the unregister doesn't work, because it's for brokers.
Did you try last suggestion by Jakub?

You can also try to add it to the .status.registeredNodeIds list in the Kafka CR with kubectl edit kafka my-cluster --subresource=status to trigger the unregistration.

Also can you describe the steps you did to go from brokers 0,1,2 (in one nodepool) and controllers 3,4,5 (in another nodepool) to brokers/controllers 0,1,2? I could try to replicate what you had.

0 replies

osklil · 2025-01-24T08:26:26Z

osklil
Jan 24, 2025
Author

@ppatierno If you mean "You can also try to add it to the .status.registeredNodeIds list in the Kafka CR with kubectl edit kafka my-cluster --subresource=status to trigger the unregistration." - I did try that (by adding 3,4,5 to .status.registeredNodeIds which contained 0,1,2). It had no effect and only 0,1,2 remained.

It was a little while ago, but I recall doing this:

Getting to the state with two KafkaNodePools, one with brokers only and one with controllers.
Updating the first KafkaNodePool to have both controller and broker roles.
Deleting the second KafkaNodePool.

0 replies

ezh · 2025-02-16T19:13:20Z

ezh
Feb 16, 2025

Hey. We have the same issue.

--subresource=status is not an option
/opt/kafka/bin/kafka-cluster.sh unregister -i 7 --config /tmp/tls-config.properties --bootstrap-server stage-controller-5.stage-kafka-brokers.kafka.svc.cluster.local:9090

[2025-02-16 19:06:40,223] WARN [AdminClient clientId=adminclient-1] The remote node is not a BROKER that supports the METADATA api. (org.apache.kafka.clients.admin.internals.AdminMetadataManager)
org.apache.kafka.common.errors.UnsupportedVersionException: The node does not support METADATA
[2025-02-16 19:06:40,225] WARN [AdminClient clientId=adminclient-1] The remote node is not a BROKER that supports the METADATA api. (org.apache.kafka.clients.admin.internals.AdminMetadataManager)
org.apache.kafka.common.errors.UnsupportedVersionException: The node does not support METADATA
The target cluster does not support the broker unregistration API.
[2025-02-16 19:06:40,225] WARN [AdminClient clientId=adminclient-1] The remote node is not a BROKER that supports the METADATA api. (org.apache.kafka.clients.admin.internals.AdminMetadataManager)
org.apache.kafka.common.errors.UnsupportedVersionException: The node does not support METADATA
[2025-02-16 19:06:40,226] WARN [AdminClient clientId=adminclient-1] The remote node is not a BROKER that supports the METADATA api. (org.apache.kafka.clients.admin.internals.AdminMetadataManager)
org.apache.kafka.common.errors.UnsupportedVersionException: The node does not support METADATA

For the cluster-id command the situation is the same.

Strimzi operator 0.44
Kafka migration 3.7.1 -> 3.8.0

NAME                                        READY   STATUS    RESTARTS        AGE
kafka-connect-cluster-connect-0             1/1     Running   0               119m
kafka-connect-cluster-connect-1             1/1     Running   0               118m
kafka-connect-cluster-connect-2             1/1     Running   0               117m
kafka-kafka-ui-696c76cb98-hvhzw             1/1     Running   3 (4d10h ago)   4d10h
stage-controller-3                          1/1     Running   1 (86m ago)     108m
stage-controller-4                          1/1     Running   0               110m
stage-controller-5                          1/1     Running   0               109m
stage-entity-operator-6bd8b5c-vxh2l         2/2     Running   0               114m
stage-kafka-0                               1/1     Running   0               107m
stage-kafka-1                               1/1     Running   0               106m
stage-kafka-2                               1/1     Running   0               105m
stage-kafka-exporter-5c6db786fc-ggbxf       1/1     Running   0               105m
strimzi-cluster-operator-76b875b755-9v4mb   1/1     Running   0               121m

Creating the 2nd cluster and migrating data there with KafkaMirrorMaker is more straightforward withswitch Kubernetes services to the new location. WDYT?

0 replies

scholzj · 2025-02-16T19:21:13Z

scholzj
Feb 16, 2025
Maintainer

--subresource=status is not an option

Not an option of what? You should probably share the full command and full output.

/opt/kafka/bin/kafka-cluster.sh unregister -i 7 --config /tmp/tls-config.properties --bootstrap-server stage-controller-5.stage-kafka-brokers.kafka.svc.cluster.local:9090>

Not sure if this is related as this is really a Kafka business. But you should probably run it against your own listener and not against 9090, expecially with older Kafka versions where the 9090 port is mostly unresponsive to most commands.

0 replies

ezh · 2025-02-16T19:58:12Z

ezh
Feb 16, 2025

The Strimzi controllers have only three ports. Am I correct that we must connect to controller with the admin client?

--subresource=status is not an option because the operator doesn't provide meaningful information regarding the lost broker node. The cluster with MetadataUpdateFailed is considered as reconciled already. The new node in the status section disappeared right after the patch command.

apiVersion: v1
kind: Pod
metadata:
...
  labels:
    app.kubernetes.io/managed-by: strimzi-cluster-operator
    strimzi.io/controller-role: "true"
  name: stage-controller-5
  namespace: kafka
  ...
    name: kafka
    ports:
    - containerPort: 8443
      name: tcp-kafkaagent
      protocol: TCP
    - containerPort: 9090
      name: tcp-ctrlplane
      protocol: TCP
    - containerPort: 9404
      name: tcp-prometheus
      protocol: TCP
...

1 reply

scholzj Feb 16, 2025
Maintainer

--subresource=status is not an option because the operator doesn't provide meaningful information regarding the lost broker node.

I still don't understand what this means. That you don't know what nodes to unregister? They are listed in the error message, or? Anyway, keep in mind that this is a Kafka issue in the first place. If Strimzi knew the node IDs it would unregister them

The new node in the status section disappeared right after the patch command.

That yould have many resons. It was processed, you patched it incorrectly, etc. Without full logs, commands, steps, etc., it is hard to say.

The Strimzi controllers have only three ports. Am I correct that we must connect to controller with the admin client?

As I said, in the previous versions of Kafka the controllers did not really talk with you in the first place. You have to talk with brokers.

ezh · 2025-02-16T20:15:00Z

ezh
Feb 16, 2025

For the note and other readers.

we have

Listener external contains configuration for brokers with IDs [3, 4, 5] that are curren
 org.apache.kafka.common.errors.InvalidUpdateVersionException: Invalid update version 20 for feature metadata.version. Controller 7 only supports versions 1-19

kubectl get kafka stage -n kafka -o jsonpath='{.status.registeredNodeIds}'
[0,1,2,3,4,5]

It is tough to make mistakes in

kubectl patch kafka stage -n kafka --type='json' -p='[{"op": "replace", "path": "/status/registeredNodeIds", "value": [0,1,2,3,4,5,7]}]'

0 replies

ezh · 2025-02-16T20:26:20Z

ezh
Feb 16, 2025

Regarding the suggestion to connect to brokers instead of controllers 🤷

KAFKA_OPTS="-Djava.security.auth.login.config=/tmp/jaas.conf" /opt/kafka/bin/kafka-cluster.sh unregister --config /tmp/plain.properties -i 7 --bootstrap-server localhost:9092

[2025-02-16 20:23:07,095] ERROR [AdminClient clientId=adminclient-1] Unregister broker request for broker ID 7 failed: The given broker ID was not registered. (org.apache.kafka.clients.admin.KafkaAdminClient)
org.apache.kafka.common.errors.BrokerIdNotRegisteredException: The given broker ID was not registered.
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.BrokerIdNotRegisteredException: The given broker ID was not registered.
        at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
        at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165)
        at org.apache.kafka.tools.ClusterTool.unregisterCommand(ClusterTool.java:126)
        at org.apache.kafka.tools.ClusterTool.execute(ClusterTool.java:106)
        at org.apache.kafka.tools.ClusterTool.mainNoExit(ClusterTool.java:47)
        at org.apache.kafka.tools.ClusterTool.main(ClusterTool.java:42)
Caused by: org.apache.kafka.common.errors.BrokerIdNotRegisteredException: The given broker ID was not registered.

We may need to wait for Kafka 4, but mirroring the data to the new cluster is the simplest way for now, as I see.

P.S. I tried to unregister ID:6 because we don't have it in our cluster, to be sure. The outcome is the same.

0 replies

ezh · 2025-02-17T11:39:55Z

ezh
Feb 17, 2025

We fixed the issue. 🎉 We scaled the controller pool (not brokers) to 7 pods and then scaled the back to 3.

If you see that controllers go crazy and cannot select the leader in 10-15 minutes, kill/restart them.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

Unable to upgrade to 3.9.0: Could not upgrade metadata.version to 21 #11147

{{title}}

Replies: 15 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Strimzi

Unable to upgrade to 3.9.0: Could not upgrade metadata.version to 21 #11147

osklil Jan 15, 2025

Bug Description

Steps to reproduce

Expected behavior

Strimzi version

Kubernetes version

Installation method

Infrastructure

Configuration files and logs

Additional context

Replies: 15 comments · 1 reply

scholzj Jan 15, 2025 Maintainer

osklil Jan 15, 2025 Author

scholzj Jan 15, 2025 Maintainer

ppatierno Jan 23, 2025 Maintainer

osklil Jan 23, 2025 Author

ppatierno Jan 24, 2025 Maintainer

osklil Jan 24, 2025 Author

ppatierno Jan 24, 2025 Maintainer

osklil Jan 24, 2025 Author

ezh Feb 16, 2025

scholzj Feb 16, 2025 Maintainer

ezh Feb 16, 2025

scholzj Feb 16, 2025 Maintainer

ezh Feb 16, 2025

ezh Feb 16, 2025

ezh Feb 17, 2025

osklil
Jan 15, 2025

Replies: 15 comments 1 reply

scholzj
Jan 15, 2025
Maintainer

osklil
Jan 15, 2025
Author

scholzj
Jan 15, 2025
Maintainer

ppatierno
Jan 23, 2025
Maintainer

osklil
Jan 23, 2025
Author

ppatierno
Jan 24, 2025
Maintainer

osklil
Jan 24, 2025
Author

ppatierno
Jan 24, 2025
Maintainer

osklil
Jan 24, 2025
Author

ezh
Feb 16, 2025

scholzj
Feb 16, 2025
Maintainer

ezh
Feb 16, 2025

scholzj Feb 16, 2025
Maintainer

ezh
Feb 16, 2025

ezh
Feb 16, 2025

ezh
Feb 17, 2025