Race condition on rke2 server with calico cni new version - v3.29.2 #7808

fmoral2 · 2025-02-21T11:13:59Z

Environmental Info:
RKE2 Version:

RC v1.31.6-rc1+rke2r1 + Calico 3.29.2
RC v1.32.2-rc1+rke2r1 + Calico 3.29.2

Node(s) CPU architecture, OS, and Version:

Rocky, Sles and ubuntu used.

Cluster Configuration:

Cluster 3 servers 1 worker
cni Calico

Describe the bug:

When starting rke2 server and joining new nodes on it , looks like new calico version is finding a race when trying to create pods and maybe its getting created faster than other chart resources needed.

Steps To Reproduce:

Installed RKE2: with cluster config above
Join new nodes at once as soon as first server is ready.

Expected behavior:

start normally

Actual behavior:

first node goes to Not ready

Additional context / logs:

Initial pods "helm-install-rke2-xxxx" fail to start because calico cni errors out with "error getting ClusterInformation". The problem is that the resource clusterinformations.crd.projectcalico.org "default" is not yet created. That resource gets created by the first calico node. The impact of this problem is low because kube-api will recreate the "helm-install-rke2-xxxx" pod (it is a job) after a few minutes and then things work as the resource is already there
calico-typha pod fails to start because of the error "error=connection is unauthorized: Unauthorized". The problem is that the "calico-typha" clusterrole is not created when the Typha pod starts. The impact of this problem is low because kube-api will recreate the calico-typha pod (it is a deployment) after some seconds and then things work as the resource is already there (

$ kubectl get felixconfigurations.crd.projectcalico.org default -o yaml

apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
metadata:
 annotations:
  projectcalico.org/metadata: '{"generation":2,"creationTimestamp":"2025-02-17T17:43:45Z","labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"rke2-calico","meta.helm.sh/release-namespace":"kube-system"},"managedFields":[{"manager":"helm","operation":"Update","apiVersion":"crd.projectcalico.org/v1","time":"2025-02-17T17:43:45Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}},"f:spec":{".":{},"f:defaultEndpointToHostAction":{},"f:featureDetectOverride":{},"f:healthPort":{},"f:logSeveritySys":{},"f:wireguardEnabled":{},"f:xdpEnabled":{}}}},{"manager":"operator","operation":"Update","apiVersion":"crd.projectcalico.org/v1","time":"2025-02-17T17:43:52Z","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{"f:nftablesMode":{},"f:vxlanVNI":{}}}}]}'
 creationTimestamp: "2025-02-17T17:43:45Z"
 generation: 3
 name: default
 resourceVersion: "1203"
 uid: 9243034e-2561-410b-8e29-f813a0f7d31b
spec:
 bpfConnectTimeLoadBalancing: TCP
 bpfHostNetworkedNATWithoutCTLB: Enabled
 bpfLogLevel: ""
 defaultEndpointToHostAction: Drop
 featureDetectOverride: ChecksumOffloadBroken=true
 floatingIPs: Disabled
 healthPort: 9099
 logSeverityScreen: Info
 logSeveritySys: Info
 nftablesMode: Disabled
 reportingInterval: 0s
 vxlanVNI: 4096
 wireguardEnabled: false
 xdpEnabled: true

$ kubectl get nodes =>. one is not Ready

NAME                     STATUS   ROLES            AGE  VERSION
ip .internal  Ready       control-plane,etcd,master  18h  v1.31.6+rke2r1
ip internal   Ready        <none>                                 18h  v1.31.6+rke2r1
ip .internal  NotReady  control-plane,etcd,master  18h  v1.31.6+rke2r1
ip internal   Ready        control-plane,etcd,master  18h  v1.31.6+rke2r1

$ kubectl get pods -A from not ready node

NAMESPACE    NAME                                 READY  STATUS       RESTARTS  AGE
calico-system  calico-node-w74p5                                                                1/1   Running       0     18h
calico-system  calico-typha-b67c75986-xd2xx                                            1/1   Terminating     0     18h
kube-system   cloud-controller-manager-ip-172-31-4-129.us-east-2.compute.internal   1/1   Running       0     18h
kube-system   etcd-ip- us-east-2.compute.internal                                      1/1   Running       0     18h
kube-system   kube-apiserver-ip .us-east-2.compute.internal                     1/1   Running       0     18h
kube-system   kube-controller-manager-ip- .us-east-2.compute.internal   1/1   Running       0     18h
kube-system   kube-proxy-ip-1 us-east-2.compute.internal                         1/1   Running       0     18h
kube-system   kube-scheduler- .us-east-2.compute.internal                        1/1   Running       0     18h
kube-system   rke2-coredns-rke2-coredns-55bdf87668-qx7p9                  0/1   Terminating     0     18h
kube-system   rke2-ingress-nginx-controller-rcl77                                        0/1   ContainerCreating  0     18h

kubectl logs calico-typha-89b7dd8dc-hm7rl -n calico-system

Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/profiles" error=connection is unauthorized: Unauthorized

kubelet log right at moment node goes NOT READY

  Warning  FailedCreatePodSandBox  33m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "2cd827447c8ff679686bbd9164d34b13aa58c17636b436b744e5144ec37000b7": plugin type="calico" failed (add): error getting ClusterInformation: resource does not exist: ClusterInformation(default) with error: clusterinformations.crd.projectcalico.org "default" not found

from typha pod

typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:24:47.322 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:24:47.322 [INFO][1] typha/rebalance.go 77: Calculated new connection limit. newLimit=10000 numNodes=4 numSyncerTypes=4 numTyphas=0 reason="error" thread="k8s-poll"
2025-02-19 14:24:47.322 [INFO][1] typha/sync_server.go 532: New target number of connections currentNum=8 newMax=10000 oldMax=400 thread="numConnsGov"
2025-02-19 14:25:19.914 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:25:19.915 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:25:51.666 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:25:51.666 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:26:23.401 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:26:23.401 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:26:53.458 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:26:53.458 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:27:24.690 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:27:24.690 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:27:56.671 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:27:56.671 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:28:29.643 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:28:29.643 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:29:01.661 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:29:01.661 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:29:33.090 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:29:33.090 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:30:03.906 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:30:03.906 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:30:36.082 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:30:36.082 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:31:08.967 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:31:08.967 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:31:39.937 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:31:39.937 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:32:12.674 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:32:12.674 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:32:45.218 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:32:45.218 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"

Calico project
projectcalico/calico@d6dbb99

The text was updated successfully, but these errors were encountered:

mdrahman-suse added the kind/bug Something isn't working label Feb 21, 2025

mdrahman-suse assigned brandond Feb 21, 2025

mdrahman-suse added this to the 2025-02 Release Cycle milestone Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition on rke2 server with calico cni new version - v3.29.2 #7808

Race condition on rke2 server with calico cni new version - v3.29.2 #7808

fmoral2 commented Feb 21, 2025 •

edited

Loading

Race condition on rke2 server with calico cni new version - v3.29.2 #7808

Race condition on rke2 server with calico cni new version - v3.29.2 #7808

Comments

fmoral2 commented Feb 21, 2025 • edited Loading

fmoral2 commented Feb 21, 2025 •

edited

Loading