Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition on rke2 server with calico cni new version - v3.29.2 #7808

Open
fmoral2 opened this issue Feb 21, 2025 · 0 comments
Open

Race condition on rke2 server with calico cni new version - v3.29.2 #7808

fmoral2 opened this issue Feb 21, 2025 · 0 comments
Assignees
Labels
kind/bug Something isn't working

Comments

@fmoral2
Copy link
Contributor

fmoral2 commented Feb 21, 2025

Environmental Info:
RKE2 Version:

RC v1.31.6-rc1+rke2r1 + Calico 3.29.2
RC v1.32.2-rc1+rke2r1 + Calico 3.29.2

Node(s) CPU architecture, OS, and Version:

Rocky, Sles and ubuntu used.

Cluster Configuration:

Cluster 3 servers 1 worker
cni Calico

Describe the bug:

When starting rke2 server and joining new nodes on it , looks like new calico version is finding a race when trying to create pods and maybe its getting created faster than other chart resources needed.

Steps To Reproduce:

  • Installed RKE2: with cluster config above
  • Join new nodes at once as soon as first server is ready.

Expected behavior:

  • start normally

Actual behavior:

  • first node goes to Not ready

Additional context / logs:

  • Initial pods "helm-install-rke2-xxxx" fail to start because calico cni errors out with "error getting ClusterInformation". The problem is that the resource clusterinformations.crd.projectcalico.org "default" is not yet created. That resource gets created by the first calico node. The impact of this problem is low because kube-api will recreate the "helm-install-rke2-xxxx" pod (it is a job) after a few minutes and then things work as the resource is already there
  • calico-typha pod fails to start because of the error "error=connection is unauthorized: Unauthorized". The problem is that the "calico-typha" clusterrole is not created when the Typha pod starts. The impact of this problem is low because kube-api will recreate the calico-typha pod (it is a deployment) after some seconds and then things work as the resource is already there (

$ kubectl get felixconfigurations.crd.projectcalico.org default -o yaml

apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
metadata:
 annotations:
  projectcalico.org/metadata: '{"generation":2,"creationTimestamp":"2025-02-17T17:43:45Z","labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"rke2-calico","meta.helm.sh/release-namespace":"kube-system"},"managedFields":[{"manager":"helm","operation":"Update","apiVersion":"crd.projectcalico.org/v1","time":"2025-02-17T17:43:45Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}},"f:spec":{".":{},"f:defaultEndpointToHostAction":{},"f:featureDetectOverride":{},"f:healthPort":{},"f:logSeveritySys":{},"f:wireguardEnabled":{},"f:xdpEnabled":{}}}},{"manager":"operator","operation":"Update","apiVersion":"crd.projectcalico.org/v1","time":"2025-02-17T17:43:52Z","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{"f:nftablesMode":{},"f:vxlanVNI":{}}}}]}'
 creationTimestamp: "2025-02-17T17:43:45Z"
 generation: 3
 name: default
 resourceVersion: "1203"
 uid: 9243034e-2561-410b-8e29-f813a0f7d31b
spec:
 bpfConnectTimeLoadBalancing: TCP
 bpfHostNetworkedNATWithoutCTLB: Enabled
 bpfLogLevel: ""
 defaultEndpointToHostAction: Drop
 featureDetectOverride: ChecksumOffloadBroken=true
 floatingIPs: Disabled
 healthPort: 9099
 logSeverityScreen: Info
 logSeveritySys: Info
 nftablesMode: Disabled
 reportingInterval: 0s
 vxlanVNI: 4096
 wireguardEnabled: false
 xdpEnabled: true

$ kubectl get nodes =>. one is not Ready

NAME                     STATUS   ROLES            AGE  VERSION
ip .internal  Ready       control-plane,etcd,master  18h  v1.31.6+rke2r1
ip internal   Ready        <none>                                 18h  v1.31.6+rke2r1
ip .internal  NotReady  control-plane,etcd,master  18h  v1.31.6+rke2r1
ip internal   Ready        control-plane,etcd,master  18h  v1.31.6+rke2r1

$ kubectl get pods -A from not ready node

NAMESPACE    NAME                                 READY  STATUS       RESTARTS  AGE
calico-system  calico-node-w74p5                                                                1/1   Running       0     18h
calico-system  calico-typha-b67c75986-xd2xx                                            1/1   Terminating     0     18h
kube-system   cloud-controller-manager-ip-172-31-4-129.us-east-2.compute.internal   1/1   Running       0     18h
kube-system   etcd-ip- us-east-2.compute.internal                                      1/1   Running       0     18h
kube-system   kube-apiserver-ip .us-east-2.compute.internal                     1/1   Running       0     18h
kube-system   kube-controller-manager-ip- .us-east-2.compute.internal   1/1   Running       0     18h
kube-system   kube-proxy-ip-1 us-east-2.compute.internal                         1/1   Running       0     18h
kube-system   kube-scheduler- .us-east-2.compute.internal                        1/1   Running       0     18h
kube-system   rke2-coredns-rke2-coredns-55bdf87668-qx7p9                  0/1   Terminating     0     18h
kube-system   rke2-ingress-nginx-controller-rcl77                                        0/1   ContainerCreating  0     18h

kubectl logs calico-typha-89b7dd8dc-hm7rl -n calico-system

Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/profiles" error=connection is unauthorized: Unauthorized

kubelet log right at moment node goes NOT READY

  Warning  FailedCreatePodSandBox  33m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "2cd827447c8ff679686bbd9164d34b13aa58c17636b436b744e5144ec37000b7": plugin type="calico" failed (add): error getting ClusterInformation: resource does not exist: ClusterInformation(default) with error: clusterinformations.crd.projectcalico.org "default" not found



 

from typha pod

typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:24:47.322 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:24:47.322 [INFO][1] typha/rebalance.go 77: Calculated new connection limit. newLimit=10000 numNodes=4 numSyncerTypes=4 numTyphas=0 reason="error" thread="k8s-poll"
2025-02-19 14:24:47.322 [INFO][1] typha/sync_server.go 532: New target number of connections currentNum=8 newMax=10000 oldMax=400 thread="numConnsGov"
2025-02-19 14:25:19.914 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:25:19.915 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:25:51.666 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:25:51.666 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:26:23.401 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:26:23.401 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:26:53.458 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:26:53.458 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:27:24.690 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:27:24.690 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:27:56.671 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:27:56.671 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:28:29.643 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:28:29.643 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:29:01.661 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:29:01.661 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:29:33.090 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:29:33.090 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:30:03.906 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:30:03.906 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:30:36.082 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:30:36.082 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:31:08.967 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:31:08.967 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:31:39.937 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:31:39.937 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:32:12.674 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:32:12.674 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"
2025-02-19 14:32:45.218 [ERROR][1] typha/lookup.go 66: Failed to get Typha endpoint from Kubernetes error=Unauthorized
2025-02-19 14:32:45.218 [WARNING][1] typha/rebalance.go 54: Failed to get number of Typhas error=Unauthorized numTyphas=0 thread="k8s-poll"

Calico project
projectcalico/calico@d6dbb99

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants