Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rke2-windows] Windows node in NotReady state after it joins the cluster #7793

Open
mdrahman-suse opened this issue Feb 20, 2025 · 12 comments
Open
Assignees
Labels
kind/bug Something isn't working status/release-blocker

Comments

@mdrahman-suse
Copy link
Contributor

Environmental Info:
RKE2 Version:

v1.29.14-rc1 and all the latest RCs (v1.30.10, v1.31.6, v1.32.2)

Node(s) CPU architecture, OS, and Version:

Ubuntu 24.04 server and agent nodes
Windows 2019 and 2022 agent nodes

Cluster Configuration:

1 server, 1 agent and 1 Windows agent

Describe the bug:

After Windows agent joins the cluster it remains in NotReady state. Observed on v1.29 after this commit c3050110de27bb3463ece3117ce6fa5509d89b73 and the latest RCs. Worked fine up until this commit a25f441. Most likely started happening after the k3s pull through

Steps To Reproduce:

  • Installed RKE2 and started service on server and agent nodes
  • Installed RKE2 on Windows node and started rke2 service
  • Ensure cluster is up and all nodes are in Ready state

Expected behavior:

  • Expected all nodes to be in Ready state

Actual behavior:

  • Observed Windows node is in NotReady state

Additional context / logs:

Nothing much on the server logs, Observed below when rke2 service is run with debug mode

time="2025-02-20T00:07:30Z" level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
time="2025-02-20T00:07:30Z" level=info msg="Server <ip>:9345@RECOVERING*->ACTIVE from successful health check"
time="2025-02-20T00:07:31Z" level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
time="2025-02-20T00:07:32Z" level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
time="2025-02-20T00:07:33Z" level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
time="2025-02-20T00:07:34Z" level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
time="2025-02-20T00:07:34Z" level=debug msg="Wrote ping"
@mdrahman-suse mdrahman-suse added the kind/bug Something isn't working label Feb 20, 2025
@brandond
Copy link
Member

You've not included any output showing the node status. Can you provide the yaml and/or described node? Kubelet, containerd, and CNI logs may also be useful.

@mdrahman-suse
Copy link
Contributor Author

Cluster status

Deployments:
Linux: https://github.com/rancher/distros-test-framework/blob/main/workloads/amd64/pod_client.yaml
Windows: https://github.com/rancher/distros-test-framework/blob/main/workloads/amd64/windows_app_deployment.yaml

Windows deployment is pending likely due to node in NotReady state

$ k get nodes
NAME                                          STATUS     ROLES                       AGE   VERSION
ip-172-31-18-156.us-east-2.compute.internal   Ready      <none>                      64m   v1.29.14+rke2r1
ip-172-31-29-249.us-east-2.compute.internal   Ready      control-plane,etcd,master   69m   v1.29.14+rke2r1
ip-ac1f10b3                                   NotReady   <none>                      62m   v1.29.14

$ k get pods -A
NAMESPACE     NAME                                                                   READY   STATUS      RESTARTS   AGE
default       client-deployment-5846fc994f-4mjzj                                     1/1     Running     0          53m
default       client-deployment-5846fc994f-z5qhf                                     1/1     Running     0          53m
default       windows-app-deployment-6964ff4fb8-k9d7f                                0/1     Pending     0          53m
default       windows-app-deployment-6964ff4fb8-wmdj7                                0/1     Pending     0          53m
kube-system   cloud-controller-manager-ip-172-31-29-249.us-east-2.compute.internal   1/1     Running     0          69m
kube-system   etcd-ip-172-31-29-249.us-east-2.compute.internal                       1/1     Running     0          68m
kube-system   helm-install-rke2-coredns-2grnf                                        0/1     Completed   0          69m
kube-system   helm-install-rke2-flannel-k9cgv                                        0/1     Completed   0          69m
kube-system   helm-install-rke2-ingress-nginx-5c6pq                                  0/1     Completed   0          69m
kube-system   helm-install-rke2-metrics-server-4n4f8                                 0/1     Completed   0          69m
kube-system   helm-install-rke2-runtimeclasses-k7bq2                                 0/1     Completed   0          69m
kube-system   helm-install-rke2-snapshot-controller-5xc6k                            0/1     Completed   2          69m
kube-system   helm-install-rke2-snapshot-controller-crd-m7kpq                        0/1     Completed   0          69m
kube-system   kube-apiserver-ip-172-31-29-249.us-east-2.compute.internal             1/1     Running     0          69m
kube-system   kube-controller-manager-ip-172-31-29-249.us-east-2.compute.internal    1/1     Running     0          69m
kube-system   kube-flannel-ds-d42gs                                                  1/1     Running     0          69m
kube-system   kube-flannel-ds-xkxd9                                                  1/1     Running     0          64m
kube-system   kube-proxy-ip-172-31-18-156.us-east-2.compute.internal                 1/1     Running     0          64m
kube-system   kube-proxy-ip-172-31-29-249.us-east-2.compute.internal                 1/1     Running     0          69m
kube-system   kube-scheduler-ip-172-31-29-249.us-east-2.compute.internal             1/1     Running     0          69m
kube-system   rke2-coredns-rke2-coredns-58664888cf-5m7jz                             1/1     Running     0          69m
kube-system   rke2-coredns-rke2-coredns-58664888cf-6w7b4                             1/1     Running     0          64m
kube-system   rke2-coredns-rke2-coredns-autoscaler-7dfbb46d5d-5kdm9                  1/1     Running     0          69m
kube-system   rke2-ingress-nginx-controller-skzt4                                    1/1     Running     0          67m
kube-system   rke2-ingress-nginx-controller-twv86                                    1/1     Running     0          64m
kube-system   rke2-metrics-server-8599b78c6d-glnvs                                   1/1     Running     0          68m
kube-system   rke2-snapshot-controller-55d765465-s52rn                               1/1     Running     0          68m

Here are the logs

CNI pod log
$ k logs -n kube-system pod/kube-flannel-ds-d42gs
Defaulted container "kube-flannel" out of: kube-flannel, install-cni-plugins (init), install-cni (init)
I0220 04:32:19.264230       1 main.go:211] CLI flags config: {etcdEndpoints:http://127.0.0.1:4001,http://127.0.0.1:2379 etcdPrefix:/coreos.com/network etcdKeyfile: etcdCertfile: etcdCAFile: etcdUsername: etcdPassword: version:false kubeSubnetMgr:true kubeApiUrl: kubeAnnotationPrefix:flannel.alpha.coreos.com kubeConfigFile: iface:[] ifaceRegex:[] ipMasq:true ifaceCanReach: subnetFile:/run/flannel/subnet.env publicIP: publicIPv6: subnetLeaseRenewMargin:60 healthzIP:0.0.0.0 healthzPort:0 iptablesResyncSeconds:5 iptablesForwardRules:true netConfPath:/etc/kube-flannel/net-conf.json setNodeNetworkUnavailable:true}
W0220 04:32:19.264540       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0220 04:32:19.302753       1 kube.go:139] Waiting 10m0s for node controller to sync
I0220 04:32:19.307003       1 kube.go:469] Starting kube subnet manager
I0220 04:32:20.302905       1 kube.go:146] Node controller sync successful
I0220 04:32:20.302933       1 main.go:231] Created subnet manager: Kubernetes Subnet Manager - ip-172-31-29-249.us-east-2.compute.internal
I0220 04:32:20.302944       1 main.go:234] Installing signal handlers
I0220 04:32:20.303167       1 main.go:468] Found network config - Backend type: vxlan
I0220 04:32:20.322184       1 kube.go:669] List of node(ip-172-31-29-249.us-east-2.compute.internal) annotations: map[string]string{"alpha.kubernetes.io/provided-node-ip":"172.31.29.249", "etcd.rke2.cattle.io/local-snapshots-timestamp":"2025-02-20T04:31:35Z", "etcd.rke2.cattle.io/node-address":"172.31.29.249", "etcd.rke2.cattle.io/node-name":"ip-172-31-29-249.us-east-2.compute.internal-6e2aae15", "node.alpha.kubernetes.io/ttl":"0", "rke2.io/encryption-config-hash":"start-0281a0f156f7e23449af6327ee9df39cdbd73e88cca7c1e48c8d8baa6e064cfd", "rke2.io/external-ip":"13.58.37.51", "rke2.io/hostname":"ip-172-31-29-249.us-east-2.compute.internal", "rke2.io/internal-ip":"172.31.29.249", "rke2.io/node-args":"[\"server\",\"--write-kubeconfig-mode\",\"0644\",\"--tls-san\",\"fake.fqdn.value\",\"--node-name\",\"ip-172-31-29-249.us-east-2.compute.internal\",\"--cni\",\"flannel\",\"--node-external-ip\",\"13.58.37.51\",\"--node-ip\",\"172.31.29.249\",\"--node-label\",\"role-etcd=true\",\"--node-label\",\"role-control-plane=true\",\"--node-label\",\"role-worker=true\"]", "rke2.io/node-config-hash":"EV7KPYC33IJYBADMMPLZOAZSDK5FK6V4F7TSQHFE2EJSCRVDLGUA====", "rke2.io/node-env":"{\"RKE2_SELINUX\":\"true\"}", "volumes.kubernetes.io/controller-managed-attach-detach":"true"}
I0220 04:32:20.322252       1 match.go:211] Determining IP address of default interface
I0220 04:32:20.328912       1 match.go:264] Using interface with name eth0 and address 172.31.29.249
I0220 04:32:20.328966       1 match.go:286] Defaulting external address to interface address (172.31.29.249)
I0220 04:32:20.329214       1 vxlan.go:141] VXLAN config: VNI=4096 Port=4789 GBP=false Learning=false DirectRouting=false
I0220 04:32:20.354742       1 kube.go:636] List of node(ip-172-31-29-249.us-east-2.compute.internal) annotations: map[string]string{"alpha.kubernetes.io/provided-node-ip":"172.31.29.249", "etcd.rke2.cattle.io/local-snapshots-timestamp":"2025-02-20T04:31:35Z", "etcd.rke2.cattle.io/node-address":"172.31.29.249", "etcd.rke2.cattle.io/node-name":"ip-172-31-29-249.us-east-2.compute.internal-6e2aae15", "node.alpha.kubernetes.io/ttl":"0", "rke2.io/encryption-config-hash":"start-0281a0f156f7e23449af6327ee9df39cdbd73e88cca7c1e48c8d8baa6e064cfd", "rke2.io/external-ip":"13.58.37.51", "rke2.io/hostname":"ip-172-31-29-249.us-east-2.compute.internal", "rke2.io/internal-ip":"172.31.29.249", "rke2.io/node-args":"[\"server\",\"--write-kubeconfig-mode\",\"0644\",\"--tls-san\",\"fake.fqdn.value\",\"--node-name\",\"ip-172-31-29-249.us-east-2.compute.internal\",\"--cni\",\"flannel\",\"--node-external-ip\",\"13.58.37.51\",\"--node-ip\",\"172.31.29.249\",\"--node-label\",\"role-etcd=true\",\"--node-label\",\"role-control-plane=true\",\"--node-label\",\"role-worker=true\"]", "rke2.io/node-config-hash":"EV7KPYC33IJYBADMMPLZOAZSDK5FK6V4F7TSQHFE2EJSCRVDLGUA====", "rke2.io/node-env":"{\"RKE2_SELINUX\":\"true\"}", "volumes.kubernetes.io/controller-managed-attach-detach":"true"}
I0220 04:32:20.615983       1 iptables.go:51] Starting flannel in iptables mode...
W0220 04:32:20.616878       1 main.go:522] no subnet found for key: FLANNEL_NETWORK in file: /run/flannel/subnet.env
W0220 04:32:20.616893       1 main.go:522] no subnet found for key: FLANNEL_SUBNET in file: /run/flannel/subnet.env
W0220 04:32:20.616904       1 main.go:557] no subnet found for key: FLANNEL_IPV6_NETWORK in file: /run/flannel/subnet.env
W0220 04:32:20.616913       1 main.go:557] no subnet found for key: FLANNEL_IPV6_SUBNET in file: /run/flannel/subnet.env
I0220 04:32:20.616923       1 iptables.go:115] Current network or subnet (10.42.0.0/16, 10.42.0.0/24) is not equal to previous one (0.0.0.0/0, 0.0.0.0/0), trying to recycle old iptables rules
I0220 04:32:20.617717       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.0.0/24]
I0220 04:32:20.658099       1 iptables.go:125] Setting up masking rules
I0220 04:32:20.660489       1 iptables.go:226] Changing default FORWARD chain policy to ACCEPT
I0220 04:32:20.662332       1 main.go:412] Wrote subnet file to /run/flannel/subnet.env
I0220 04:32:20.662355       1 main.go:416] Running backend.
I0220 04:32:20.667367       1 vxlan_network.go:65] watching for new subnet leases
I0220 04:32:20.708198       1 main.go:437] Waiting for all goroutines to exit
I0220 04:32:20.711708       1 iptables.go:372] bootstrap done
I0220 04:32:20.718685       1 iptables.go:372] bootstrap done
I0220 04:37:00.842831       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.1.0/24]
I0220 04:37:00.843895       1 subnet.go:152] Batch elem [0] is { lease.Event{Type:0, Lease:lease.Lease{EnableIPv4:true, EnableIPv6:false, Subnet:ip.IP4Net{IP:0xa2a0100, PrefixLen:0x18}, IPv6Subnet:ip.IP6Net{IP:(*ip.IP6)(nil), PrefixLen:0x0}, Attrs:lease.LeaseAttrs{PublicIP:0xac1f129c, PublicIPv6:(*ip.IP6)(nil), BackendType:"vxlan", BackendData:json.RawMessage{0x7b, 0x22, 0x56, 0x4e, 0x49, 0x22, 0x3a, 0x34, 0x30, 0x39, 0x36, 0x2c, 0x22, 0x56, 0x74, 0x65, 0x70, 0x4d, 0x41, 0x43, 0x22, 0x3a, 0x22, 0x30, 0x32, 0x3a, 0x32, 0x37, 0x3a, 0x65, 0x38, 0x3a, 0x63, 0x61, 0x3a, 0x33, 0x34, 0x3a, 0x36, 0x65, 0x22, 0x7d}, BackendV6Data:json.RawMessage(nil)}, Expiration:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Asof:0}} }
I0220 04:37:00.844030       1 vxlan_network.go:100] Received Subnet Event with VxLan: BackendType: vxlan, PublicIP: 172.31.18.156, PublicIPv6: (nil), BackendData: {"VNI":4096,"VtepMAC":"02:27:e8:ca:34:6e"}, BackendV6Data: (nil)
I0220 04:38:21.015030       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.2.0/24]
I0220 04:38:21.015227       1 subnet.go:152] Batch elem [0] is { lease.Event{Type:0, Lease:lease.Lease{EnableIPv4:true, EnableIPv6:false, Subnet:ip.IP4Net{IP:0xa2a0200, PrefixLen:0x18}, IPv6Subnet:ip.IP6Net{IP:(*ip.IP6)(nil), PrefixLen:0x0}, Attrs:lease.LeaseAttrs{PublicIP:0xac1f10b3, PublicIPv6:(*ip.IP6)(nil), BackendType:"vxlan", BackendData:json.RawMessage{0x7b, 0x22, 0x56, 0x4e, 0x49, 0x22, 0x3a, 0x34, 0x30, 0x39, 0x36, 0x2c, 0x22, 0x56, 0x74, 0x65, 0x70, 0x4d, 0x41, 0x43, 0x22, 0x3a, 0x22, 0x22, 0x7d}, BackendV6Data:json.RawMessage(nil)}, Expiration:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Asof:0}} }
I0220 04:38:21.015442       1 vxlan_network.go:100] Received Subnet Event with VxLan: BackendType: vxlan, PublicIP: 172.31.16.179, PublicIPv6: (nil), BackendData: {"VNI":4096,"VtepMAC":""}, BackendV6Data: (nil)
E0220 04:38:21.015494       1 vxlan_network.go:115] error decoding subnet lease JSON: invalid MAC address
I0220 04:38:26.650460       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.2.0/24]
I0220 04:38:26.657651       1 subnet.go:152] Batch elem [0] is { lease.Event{Type:0, Lease:lease.Lease{EnableIPv4:true, EnableIPv6:false, Subnet:ip.IP4Net{IP:0xa2a0200, PrefixLen:0x18}, IPv6Subnet:ip.IP6Net{IP:(*ip.IP6)(nil), PrefixLen:0x0}, Attrs:lease.LeaseAttrs{PublicIP:0xac1f10b3, PublicIPv6:(*ip.IP6)(nil), BackendType:"vxlan", BackendData:json.RawMessage{0x7b, 0x22, 0x56, 0x4e, 0x49, 0x22, 0x3a, 0x34, 0x30, 0x39, 0x36, 0x2c, 0x22, 0x56, 0x74, 0x65, 0x70, 0x4d, 0x41, 0x43, 0x22, 0x3a, 0x22, 0x30, 0x30, 0x3a, 0x31, 0x35, 0x3a, 0x35, 0x64, 0x3a, 0x34, 0x31, 0x3a, 0x35, 0x30, 0x3a, 0x63, 0x35, 0x22, 0x7d}, BackendV6Data:json.RawMessage(nil)}, Expiration:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Asof:0}} }
I0220 04:38:26.671997       1 vxlan_network.go:100] Received Subnet Event with VxLan: BackendType: vxlan, PublicIP: 172.31.16.179, PublicIPv6: (nil), BackendData: {"VNI":4096,"VtepMAC":"00:15:5d:41:50:c5"}, BackendV6Data: (nil)
  • From Server node

nodes.txt
containerd.log
kubelet.log

  • From Windows node

win-containerd.log
win-kubelet.log

@brandond
Copy link
Member

brandond commented Feb 20, 2025

Is this happening with both CNIs that we support on Windows, or only flannel?

Can you also grab the containerd config.toml? I suspect something is going on with the CNI bin dir setting in the updated template.

@siprbaum
Copy link

In #7771 (linked to this one) the Windows Node is in NotReady state was based on Calico CNI and RKE2 v1.31.6-rc1+rke2r1.

@manuelbuil
Copy link
Contributor

manuelbuil commented Feb 20, 2025

Can you get calico's and flannel's log? They are in C:\var\lib\rancher\rke2\agent\logs\

@mdrahman-suse
Copy link
Contributor Author

mdrahman-suse commented Feb 20, 2025

Is this happening with both CNIs that we support on Windows, or only flannel?

Can you also grab the containerd config.toml? I suspect something is going on with the CNI bin dir setting in the updated template.

Its happening for both the CNIs CC @brandond

Here is the config.toml

Latest rc

  • Server
$ sudo cat /var/lib/rancher/rke2/agent/etc/containerd/config.toml
# File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead.
version = 2
root = "/var/lib/rancher/rke2/agent/containerd"
state = "/run/k3s/containerd"

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/rke2/agent/containerd"

[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = true
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  device_ownership_from_security_context = false
  sandbox_image = "index.docker.io/rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true




[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runhcs-wcow-process]
  runtime_type = "io.containerd.runhcs.v1"

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d"
  • Windows agent
cat C:\var\lib\rancher\rke2\agent\etc\containerd\config.toml
# File generated by . DO NOT EDIT. Use config.toml.tmpl instead.
version = 2
root = "C:\\var\\lib\\rancher\\rke2\\agent\\containerd"
state = "C:\\var\\lib\\rancher\\rke2\\agent\\containerd\\state"

[plugins."io.containerd.internal.v1.opt"]
  path = "C:\\var\\lib\\rancher\\rke2\\agent\\containerd"

[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = false
  enable_unprivileged_icmp = false
  device_ownership_from_security_context = false
  disable_cgroup = true

  sandbox_image = "index.docker.io/rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "windows"
  disable_snapshot_annotations = true
  default_runtime_name = "runhcs-wcow-process"



[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = false

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runhcs-wcow-process]
  runtime_type = "io.containerd.runhcs.v1"

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "C:\\var\\lib\\rancher\\rke2\\agent\\etc\\containerd\\certs.d"

Previous release

  • Server
$ sudo cat /var/lib/rancher/rke2/agent/etc/containerd/config.toml
# File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead.
version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/rke2/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  device_ownership_from_security_context = false
  sandbox_image = "index.docker.io/rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true




[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d"
  • Windows agent
cat C:\var\lib\rancher\rke2\agent\etc\containerd\config.toml
# File generated by . DO NOT EDIT. Use config.toml.tmpl instead.
version = 2
root = "C:\\var\\lib\\rancher\\rke2\\agent\\containerd"
state = "C:\\var\\lib\\rancher\\rke2\\agent\\containerd\\state"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = 0

[grpc]
  address = "//./pipe/containerd-containerd"
  tcp_address = ""
  tcp_tls_cert = ""
  tcp_tls_key = ""
  uid = 0
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[ttrpc]
  address = ""
  uid = 0
  gid = 0

[debug]
  address = ""
  uid = 0
  gid = 0
  level = ""

[metrics]
  address = ""
  grpc_histogram = false

[cgroup]
  path = ""

[timeouts]
  "io.containerd.timeout.shim.cleanup" = "5s"
  "io.containerd.timeout.shim.load" = "5s"
  "io.containerd.timeout.shim.shutdown" = "3s"
  "io.containerd.timeout.task.state" = "2s"

[plugins]
  [plugins."io.containerd.gc.v1.scheduler"]
    pause_threshold = 0.02
    deletion_threshold = 0
    mutation_threshold = 100
    schedule_delay = "0s"
    startup_delay = "100ms"
  [plugins."io.containerd.grpc.v1.cri"]
    disable_tcp_service = true
    stream_server_address = "127.0.0.1"
    stream_server_port = "0"
    stream_idle_timeout = "4h0m0s"
    enable_selinux = false
    selinux_category_range = 0
    sandbox_image = "index.docker.io/rancher/mirrored-pause:3.6"
    stats_collect_period = 10
    systemd_cgroup = false
    enable_tls_streaming = false
    max_container_log_line_size = 16384
    disable_cgroup = false
    disable_apparmor = false
    restrict_oom_score_adj = false
    max_concurrent_downloads = 3
    disable_proc_mount = false
    unset_seccomp_profile = ""
    tolerate_missing_hugetlb_controller = false
    disable_hugetlb_controller = false
    ignore_image_defined_volumes = false
    [plugins."io.containerd.grpc.v1.cri".containerd]
      snapshotter = "windows"
      default_runtime_name = "runhcs-wcow-process"
      no_pivot = false
      disable_snapshot_annotations = false
      discard_unpacked_layers = false
      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
        runtime_type = ""
        runtime_engine = ""
        runtime_root = ""
        privileged_without_host_devices = false
        base_runtime_spec = ""
      [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
        runtime_type = ""
        runtime_engine = ""
        runtime_root = ""
        privileged_without_host_devices = false
        base_runtime_spec = ""
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runhcs-wcow-process]
          runtime_type = "io.containerd.runhcs.v1"
          runtime_engine = ""
          runtime_root = ""
          privileged_without_host_devices = false
          base_runtime_spec = ""
    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "c:\\var\\lib\\rancher\\rke2\\bin"
      conf_dir = "c:\\var\\lib\\rancher\\rke2\\agent\\etc\\cni"
      max_conf_num = 1
      conf_template = ""
    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = "C:\\var\\lib\\rancher\\rke2\\agent\\etc\\containerd\\certs.d"




    [plugins."io.containerd.grpc.v1.cri".image_decryption]
      key_model = ""
    [plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
      tls_cert_file = ""
      tls_key_file = ""
  [plugins."io.containerd.internal.v1.opt"]
    path = "C:\\var\\lib\\rancher\\rke2\\agent\\containerd"
  [plugins."io.containerd.internal.v1.restart"]
    interval = "10s"
  [plugins."io.containerd.metadata.v1.bolt"]
    content_sharing_policy = "shared"
  [plugins."io.containerd.runtime.v2.task"]
    platforms = ["windows/amd64", "linux/amd64"]
  [plugins."io.containerd.service.v1.diff-service"]
    default = ["windows", "windows-lcow"]

@mdrahman-suse
Copy link
Contributor Author

mdrahman-suse commented Feb 20, 2025

Can you get calico's and flannel's log? They are in C:\var\lib\rancher\rke2\agent\logs\

Here are the logs from Windows node CC @manuelbuil

  • Calico

calico-node.log
felix.log
kube-proxy.log
kubelet.log

  • Flannel

flanneld.log
kube-proxy.log
kubelet.log

@mdrahman-suse
Copy link
Contributor Author

mdrahman-suse commented Feb 20, 2025

I see the new rc does not have below in the config.toml

 [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "c:\\var\\lib\\rancher\\rke2\\bin"
      conf_dir = "c:\\var\\lib\\rancher\\rke2\\agent\\etc\\cni"
      max_conf_num = 1
      conf_template = ""

Could that be the issue @brandond?
Ref: The error in containerd.log

time="2025-02-20T04:38:09.274398500Z" level=info msg="Get image filesystem path \"C:\\\\var\\\\lib\\\\rancher\\\\rke2\\\\agent\\\\containerd\\\\io.containerd.snapshotter.v1.windows\""
time="2025-02-20T04:38:09.277492100Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in C:\\Program Files\\containerd\\cni\\conf: cni plugin not initialized: failed to load cni config"

@brandond
Copy link
Member

Apparently Linux RKE2 nodes use the default CNI bin path (it is unset in the config), but on Windows it is set to c:\var\lib\rancher\rke2\bin. I am not sure why this isn't consistent across both platforms, as it is on K3s - but that should be fixable.

@manuelbuil
Copy link
Contributor

manuelbuil commented Feb 20, 2025

Can you get calico's and flannel's log? They are in C:\var\lib\rancher\rke2\agent\logs\

Here are the logs from Windows node CC @manuelbuil

  • Calico

calico-node.log felix.log kube-proxy.log kubelet.log

  • Flannel

flanneld.log kube-proxy.log kubelet.log

Just to add some information, the logs seem correct, so the network infrastructure should be well created. It is likely that the node can't find the cni binary as you guys are already discovering

@shwethadec01
Copy link

Thanks @brandond for identifying the root cause of the issue, Would you be able to share an estimated timeline for when the fix might be available?

@brandond
Copy link
Member

Before final release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working status/release-blocker
Projects
None yet
Development

No branches or pull requests

5 participants