[rke2-windows] Windows node in `NotReady` state after it joins the cluster #7793

mdrahman-suse · 2025-02-20T02:43:29Z

Environmental Info:
RKE2 Version:

v1.29.14-rc1 and all the latest RCs (v1.30.10, v1.31.6, v1.32.2)

Node(s) CPU architecture, OS, and Version:

Ubuntu 24.04 server and agent nodes
Windows 2019 and 2022 agent nodes

Cluster Configuration:

1 server, 1 agent and 1 Windows agent

Describe the bug:

After Windows agent joins the cluster it remains in NotReady state. Observed on v1.29 after this commit c3050110de27bb3463ece3117ce6fa5509d89b73 and the latest RCs. Worked fine up until this commit a25f441. Most likely started happening after the k3s pull through

Steps To Reproduce:

Installed RKE2 and started service on server and agent nodes
Installed RKE2 on Windows node and started rke2 service
Ensure cluster is up and all nodes are in Ready state

Expected behavior:

Expected all nodes to be in Ready state

Actual behavior:

Observed Windows node is in NotReady state

Additional context / logs:

Nothing much on the server logs, Observed below when rke2 service is run with debug mode

time="2025-02-20T00:07:30Z" level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
time="2025-02-20T00:07:30Z" level=info msg="Server <ip>:9345@RECOVERING*->ACTIVE from successful health check"
time="2025-02-20T00:07:31Z" level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
time="2025-02-20T00:07:32Z" level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
time="2025-02-20T00:07:33Z" level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
time="2025-02-20T00:07:34Z" level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
time="2025-02-20T00:07:34Z" level=debug msg="Wrote ping"

The text was updated successfully, but these errors were encountered:

brandond · 2025-02-20T03:04:10Z

You've not included any output showing the node status. Can you provide the yaml and/or described node? Kubelet, containerd, and CNI logs may also be useful.

mdrahman-suse · 2025-02-20T05:47:01Z

Cluster status

Deployments:
Linux: https://github.com/rancher/distros-test-framework/blob/main/workloads/amd64/pod_client.yaml
Windows: https://github.com/rancher/distros-test-framework/blob/main/workloads/amd64/windows_app_deployment.yaml

Windows deployment is pending likely due to node in NotReady state

$ k get nodes
NAME                                          STATUS     ROLES                       AGE   VERSION
ip-172-31-18-156.us-east-2.compute.internal   Ready      <none>                      64m   v1.29.14+rke2r1
ip-172-31-29-249.us-east-2.compute.internal   Ready      control-plane,etcd,master   69m   v1.29.14+rke2r1
ip-ac1f10b3                                   NotReady   <none>                      62m   v1.29.14

$ k get pods -A
NAMESPACE     NAME                                                                   READY   STATUS      RESTARTS   AGE
default       client-deployment-5846fc994f-4mjzj                                     1/1     Running     0          53m
default       client-deployment-5846fc994f-z5qhf                                     1/1     Running     0          53m
default       windows-app-deployment-6964ff4fb8-k9d7f                                0/1     Pending     0          53m
default       windows-app-deployment-6964ff4fb8-wmdj7                                0/1     Pending     0          53m
kube-system   cloud-controller-manager-ip-172-31-29-249.us-east-2.compute.internal   1/1     Running     0          69m
kube-system   etcd-ip-172-31-29-249.us-east-2.compute.internal                       1/1     Running     0          68m
kube-system   helm-install-rke2-coredns-2grnf                                        0/1     Completed   0          69m
kube-system   helm-install-rke2-flannel-k9cgv                                        0/1     Completed   0          69m
kube-system   helm-install-rke2-ingress-nginx-5c6pq                                  0/1     Completed   0          69m
kube-system   helm-install-rke2-metrics-server-4n4f8                                 0/1     Completed   0          69m
kube-system   helm-install-rke2-runtimeclasses-k7bq2                                 0/1     Completed   0          69m
kube-system   helm-install-rke2-snapshot-controller-5xc6k                            0/1     Completed   2          69m
kube-system   helm-install-rke2-snapshot-controller-crd-m7kpq                        0/1     Completed   0          69m
kube-system   kube-apiserver-ip-172-31-29-249.us-east-2.compute.internal             1/1     Running     0          69m
kube-system   kube-controller-manager-ip-172-31-29-249.us-east-2.compute.internal    1/1     Running     0          69m
kube-system   kube-flannel-ds-d42gs                                                  1/1     Running     0          69m
kube-system   kube-flannel-ds-xkxd9                                                  1/1     Running     0          64m
kube-system   kube-proxy-ip-172-31-18-156.us-east-2.compute.internal                 1/1     Running     0          64m
kube-system   kube-proxy-ip-172-31-29-249.us-east-2.compute.internal                 1/1     Running     0          69m
kube-system   kube-scheduler-ip-172-31-29-249.us-east-2.compute.internal             1/1     Running     0          69m
kube-system   rke2-coredns-rke2-coredns-58664888cf-5m7jz                             1/1     Running     0          69m
kube-system   rke2-coredns-rke2-coredns-58664888cf-6w7b4                             1/1     Running     0          64m
kube-system   rke2-coredns-rke2-coredns-autoscaler-7dfbb46d5d-5kdm9                  1/1     Running     0          69m
kube-system   rke2-ingress-nginx-controller-skzt4                                    1/1     Running     0          67m
kube-system   rke2-ingress-nginx-controller-twv86                                    1/1     Running     0          64m
kube-system   rke2-metrics-server-8599b78c6d-glnvs                                   1/1     Running     0          68m
kube-system   rke2-snapshot-controller-55d765465-s52rn                               1/1     Running     0          68m

Here are the logs

CNI pod log

$ k logs -n kube-system pod/kube-flannel-ds-d42gs
Defaulted container "kube-flannel" out of: kube-flannel, install-cni-plugins (init), install-cni (init)
I0220 04:32:19.264230       1 main.go:211] CLI flags config: {etcdEndpoints:http://127.0.0.1:4001,http://127.0.0.1:2379 etcdPrefix:/coreos.com/network etcdKeyfile: etcdCertfile: etcdCAFile: etcdUsername: etcdPassword: version:false kubeSubnetMgr:true kubeApiUrl: kubeAnnotationPrefix:flannel.alpha.coreos.com kubeConfigFile: iface:[] ifaceRegex:[] ipMasq:true ifaceCanReach: subnetFile:/run/flannel/subnet.env publicIP: publicIPv6: subnetLeaseRenewMargin:60 healthzIP:0.0.0.0 healthzPort:0 iptablesResyncSeconds:5 iptablesForwardRules:true netConfPath:/etc/kube-flannel/net-conf.json setNodeNetworkUnavailable:true}
W0220 04:32:19.264540       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0220 04:32:19.302753       1 kube.go:139] Waiting 10m0s for node controller to sync
I0220 04:32:19.307003       1 kube.go:469] Starting kube subnet manager
I0220 04:32:20.302905       1 kube.go:146] Node controller sync successful
I0220 04:32:20.302933       1 main.go:231] Created subnet manager: Kubernetes Subnet Manager - ip-172-31-29-249.us-east-2.compute.internal
I0220 04:32:20.302944       1 main.go:234] Installing signal handlers
I0220 04:32:20.303167       1 main.go:468] Found network config - Backend type: vxlan
I0220 04:32:20.322184       1 kube.go:669] List of node(ip-172-31-29-249.us-east-2.compute.internal) annotations: map[string]string{"alpha.kubernetes.io/provided-node-ip":"172.31.29.249", "etcd.rke2.cattle.io/local-snapshots-timestamp":"2025-02-20T04:31:35Z", "etcd.rke2.cattle.io/node-address":"172.31.29.249", "etcd.rke2.cattle.io/node-name":"ip-172-31-29-249.us-east-2.compute.internal-6e2aae15", "node.alpha.kubernetes.io/ttl":"0", "rke2.io/encryption-config-hash":"start-0281a0f156f7e23449af6327ee9df39cdbd73e88cca7c1e48c8d8baa6e064cfd", "rke2.io/external-ip":"13.58.37.51", "rke2.io/hostname":"ip-172-31-29-249.us-east-2.compute.internal", "rke2.io/internal-ip":"172.31.29.249", "rke2.io/node-args":"[\"server\",\"--write-kubeconfig-mode\",\"0644\",\"--tls-san\",\"fake.fqdn.value\",\"--node-name\",\"ip-172-31-29-249.us-east-2.compute.internal\",\"--cni\",\"flannel\",\"--node-external-ip\",\"13.58.37.51\",\"--node-ip\",\"172.31.29.249\",\"--node-label\",\"role-etcd=true\",\"--node-label\",\"role-control-plane=true\",\"--node-label\",\"role-worker=true\"]", "rke2.io/node-config-hash":"EV7KPYC33IJYBADMMPLZOAZSDK5FK6V4F7TSQHFE2EJSCRVDLGUA====", "rke2.io/node-env":"{\"RKE2_SELINUX\":\"true\"}", "volumes.kubernetes.io/controller-managed-attach-detach":"true"}
I0220 04:32:20.322252       1 match.go:211] Determining IP address of default interface
I0220 04:32:20.328912       1 match.go:264] Using interface with name eth0 and address 172.31.29.249
I0220 04:32:20.328966       1 match.go:286] Defaulting external address to interface address (172.31.29.249)
I0220 04:32:20.329214       1 vxlan.go:141] VXLAN config: VNI=4096 Port=4789 GBP=false Learning=false DirectRouting=false
I0220 04:32:20.354742       1 kube.go:636] List of node(ip-172-31-29-249.us-east-2.compute.internal) annotations: map[string]string{"alpha.kubernetes.io/provided-node-ip":"172.31.29.249", "etcd.rke2.cattle.io/local-snapshots-timestamp":"2025-02-20T04:31:35Z", "etcd.rke2.cattle.io/node-address":"172.31.29.249", "etcd.rke2.cattle.io/node-name":"ip-172-31-29-249.us-east-2.compute.internal-6e2aae15", "node.alpha.kubernetes.io/ttl":"0", "rke2.io/encryption-config-hash":"start-0281a0f156f7e23449af6327ee9df39cdbd73e88cca7c1e48c8d8baa6e064cfd", "rke2.io/external-ip":"13.58.37.51", "rke2.io/hostname":"ip-172-31-29-249.us-east-2.compute.internal", "rke2.io/internal-ip":"172.31.29.249", "rke2.io/node-args":"[\"server\",\"--write-kubeconfig-mode\",\"0644\",\"--tls-san\",\"fake.fqdn.value\",\"--node-name\",\"ip-172-31-29-249.us-east-2.compute.internal\",\"--cni\",\"flannel\",\"--node-external-ip\",\"13.58.37.51\",\"--node-ip\",\"172.31.29.249\",\"--node-label\",\"role-etcd=true\",\"--node-label\",\"role-control-plane=true\",\"--node-label\",\"role-worker=true\"]", "rke2.io/node-config-hash":"EV7KPYC33IJYBADMMPLZOAZSDK5FK6V4F7TSQHFE2EJSCRVDLGUA====", "rke2.io/node-env":"{\"RKE2_SELINUX\":\"true\"}", "volumes.kubernetes.io/controller-managed-attach-detach":"true"}
I0220 04:32:20.615983       1 iptables.go:51] Starting flannel in iptables mode...
W0220 04:32:20.616878       1 main.go:522] no subnet found for key: FLANNEL_NETWORK in file: /run/flannel/subnet.env
W0220 04:32:20.616893       1 main.go:522] no subnet found for key: FLANNEL_SUBNET in file: /run/flannel/subnet.env
W0220 04:32:20.616904       1 main.go:557] no subnet found for key: FLANNEL_IPV6_NETWORK in file: /run/flannel/subnet.env
W0220 04:32:20.616913       1 main.go:557] no subnet found for key: FLANNEL_IPV6_SUBNET in file: /run/flannel/subnet.env
I0220 04:32:20.616923       1 iptables.go:115] Current network or subnet (10.42.0.0/16, 10.42.0.0/24) is not equal to previous one (0.0.0.0/0, 0.0.0.0/0), trying to recycle old iptables rules
I0220 04:32:20.617717       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.0.0/24]
I0220 04:32:20.658099       1 iptables.go:125] Setting up masking rules
I0220 04:32:20.660489       1 iptables.go:226] Changing default FORWARD chain policy to ACCEPT
I0220 04:32:20.662332       1 main.go:412] Wrote subnet file to /run/flannel/subnet.env
I0220 04:32:20.662355       1 main.go:416] Running backend.
I0220 04:32:20.667367       1 vxlan_network.go:65] watching for new subnet leases
I0220 04:32:20.708198       1 main.go:437] Waiting for all goroutines to exit
I0220 04:32:20.711708       1 iptables.go:372] bootstrap done
I0220 04:32:20.718685       1 iptables.go:372] bootstrap done
I0220 04:37:00.842831       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.1.0/24]
I0220 04:37:00.843895       1 subnet.go:152] Batch elem [0] is { lease.Event{Type:0, Lease:lease.Lease{EnableIPv4:true, EnableIPv6:false, Subnet:ip.IP4Net{IP:0xa2a0100, PrefixLen:0x18}, IPv6Subnet:ip.IP6Net{IP:(*ip.IP6)(nil), PrefixLen:0x0}, Attrs:lease.LeaseAttrs{PublicIP:0xac1f129c, PublicIPv6:(*ip.IP6)(nil), BackendType:"vxlan", BackendData:json.RawMessage{0x7b, 0x22, 0x56, 0x4e, 0x49, 0x22, 0x3a, 0x34, 0x30, 0x39, 0x36, 0x2c, 0x22, 0x56, 0x74, 0x65, 0x70, 0x4d, 0x41, 0x43, 0x22, 0x3a, 0x22, 0x30, 0x32, 0x3a, 0x32, 0x37, 0x3a, 0x65, 0x38, 0x3a, 0x63, 0x61, 0x3a, 0x33, 0x34, 0x3a, 0x36, 0x65, 0x22, 0x7d}, BackendV6Data:json.RawMessage(nil)}, Expiration:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Asof:0}} }
I0220 04:37:00.844030       1 vxlan_network.go:100] Received Subnet Event with VxLan: BackendType: vxlan, PublicIP: 172.31.18.156, PublicIPv6: (nil), BackendData: {"VNI":4096,"VtepMAC":"02:27:e8:ca:34:6e"}, BackendV6Data: (nil)
I0220 04:38:21.015030       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.2.0/24]
I0220 04:38:21.015227       1 subnet.go:152] Batch elem [0] is { lease.Event{Type:0, Lease:lease.Lease{EnableIPv4:true, EnableIPv6:false, Subnet:ip.IP4Net{IP:0xa2a0200, PrefixLen:0x18}, IPv6Subnet:ip.IP6Net{IP:(*ip.IP6)(nil), PrefixLen:0x0}, Attrs:lease.LeaseAttrs{PublicIP:0xac1f10b3, PublicIPv6:(*ip.IP6)(nil), BackendType:"vxlan", BackendData:json.RawMessage{0x7b, 0x22, 0x56, 0x4e, 0x49, 0x22, 0x3a, 0x34, 0x30, 0x39, 0x36, 0x2c, 0x22, 0x56, 0x74, 0x65, 0x70, 0x4d, 0x41, 0x43, 0x22, 0x3a, 0x22, 0x22, 0x7d}, BackendV6Data:json.RawMessage(nil)}, Expiration:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Asof:0}} }
I0220 04:38:21.015442       1 vxlan_network.go:100] Received Subnet Event with VxLan: BackendType: vxlan, PublicIP: 172.31.16.179, PublicIPv6: (nil), BackendData: {"VNI":4096,"VtepMAC":""}, BackendV6Data: (nil)
E0220 04:38:21.015494       1 vxlan_network.go:115] error decoding subnet lease JSON: invalid MAC address
I0220 04:38:26.650460       1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.42.2.0/24]
I0220 04:38:26.657651       1 subnet.go:152] Batch elem [0] is { lease.Event{Type:0, Lease:lease.Lease{EnableIPv4:true, EnableIPv6:false, Subnet:ip.IP4Net{IP:0xa2a0200, PrefixLen:0x18}, IPv6Subnet:ip.IP6Net{IP:(*ip.IP6)(nil), PrefixLen:0x0}, Attrs:lease.LeaseAttrs{PublicIP:0xac1f10b3, PublicIPv6:(*ip.IP6)(nil), BackendType:"vxlan", BackendData:json.RawMessage{0x7b, 0x22, 0x56, 0x4e, 0x49, 0x22, 0x3a, 0x34, 0x30, 0x39, 0x36, 0x2c, 0x22, 0x56, 0x74, 0x65, 0x70, 0x4d, 0x41, 0x43, 0x22, 0x3a, 0x22, 0x30, 0x30, 0x3a, 0x31, 0x35, 0x3a, 0x35, 0x64, 0x3a, 0x34, 0x31, 0x3a, 0x35, 0x30, 0x3a, 0x63, 0x35, 0x22, 0x7d}, BackendV6Data:json.RawMessage(nil)}, Expiration:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Asof:0}} }
I0220 04:38:26.671997       1 vxlan_network.go:100] Received Subnet Event with VxLan: BackendType: vxlan, PublicIP: 172.31.16.179, PublicIPv6: (nil), BackendData: {"VNI":4096,"VtepMAC":"00:15:5d:41:50:c5"}, BackendV6Data: (nil)

From Server node

nodes.txt
containerd.log
kubelet.log

From Windows node

win-containerd.log
win-kubelet.log

brandond · 2025-02-20T06:02:11Z

Is this happening with both CNIs that we support on Windows, or only flannel?

Can you also grab the containerd config.toml? I suspect something is going on with the CNI bin dir setting in the updated template.

siprbaum · 2025-02-20T11:38:33Z

In #7771 (linked to this one) the Windows Node is in NotReady state was based on Calico CNI and RKE2 v1.31.6-rc1+rke2r1.

manuelbuil · 2025-02-20T16:24:04Z

Can you get calico's and flannel's log? They are in C:\var\lib\rancher\rke2\agent\logs\

mdrahman-suse · 2025-02-20T16:54:06Z

Is this happening with both CNIs that we support on Windows, or only flannel?

Can you also grab the containerd config.toml? I suspect something is going on with the CNI bin dir setting in the updated template.

Its happening for both the CNIs CC @brandond

Here is the config.toml

Latest rc

Server

$ sudo cat /var/lib/rancher/rke2/agent/etc/containerd/config.toml
# File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead.
version = 2
root = "/var/lib/rancher/rke2/agent/containerd"
state = "/run/k3s/containerd"

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/rke2/agent/containerd"

[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = true
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  device_ownership_from_security_context = false
  sandbox_image = "index.docker.io/rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true




[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runhcs-wcow-process]
  runtime_type = "io.containerd.runhcs.v1"

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d"

Windows agent

cat C:\var\lib\rancher\rke2\agent\etc\containerd\config.toml
# File generated by . DO NOT EDIT. Use config.toml.tmpl instead.
version = 2
root = "C:\\var\\lib\\rancher\\rke2\\agent\\containerd"
state = "C:\\var\\lib\\rancher\\rke2\\agent\\containerd\\state"

[plugins."io.containerd.internal.v1.opt"]
  path = "C:\\var\\lib\\rancher\\rke2\\agent\\containerd"

[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = false
  enable_unprivileged_icmp = false
  device_ownership_from_security_context = false
  disable_cgroup = true

  sandbox_image = "index.docker.io/rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "windows"
  disable_snapshot_annotations = true
  default_runtime_name = "runhcs-wcow-process"



[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = false

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runhcs-wcow-process]
  runtime_type = "io.containerd.runhcs.v1"

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "C:\\var\\lib\\rancher\\rke2\\agent\\etc\\containerd\\certs.d"

Previous release

Server

$ sudo cat /var/lib/rancher/rke2/agent/etc/containerd/config.toml
# File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead.
version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/rke2/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  device_ownership_from_security_context = false
  sandbox_image = "index.docker.io/rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true




[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d"

Windows agent

cat C:\var\lib\rancher\rke2\agent\etc\containerd\config.toml
# File generated by . DO NOT EDIT. Use config.toml.tmpl instead.
version = 2
root = "C:\\var\\lib\\rancher\\rke2\\agent\\containerd"
state = "C:\\var\\lib\\rancher\\rke2\\agent\\containerd\\state"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = 0

[grpc]
  address = "//./pipe/containerd-containerd"
  tcp_address = ""
  tcp_tls_cert = ""
  tcp_tls_key = ""
  uid = 0
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[ttrpc]
  address = ""
  uid = 0
  gid = 0

[debug]
  address = ""
  uid = 0
  gid = 0
  level = ""

[metrics]
  address = ""
  grpc_histogram = false

[cgroup]
  path = ""

[timeouts]
  "io.containerd.timeout.shim.cleanup" = "5s"
  "io.containerd.timeout.shim.load" = "5s"
  "io.containerd.timeout.shim.shutdown" = "3s"
  "io.containerd.timeout.task.state" = "2s"

[plugins]
  [plugins."io.containerd.gc.v1.scheduler"]
    pause_threshold = 0.02
    deletion_threshold = 0
    mutation_threshold = 100
    schedule_delay = "0s"
    startup_delay = "100ms"
  [plugins."io.containerd.grpc.v1.cri"]
    disable_tcp_service = true
    stream_server_address = "127.0.0.1"
    stream_server_port = "0"
    stream_idle_timeout = "4h0m0s"
    enable_selinux = false
    selinux_category_range = 0
    sandbox_image = "index.docker.io/rancher/mirrored-pause:3.6"
    stats_collect_period = 10
    systemd_cgroup = false
    enable_tls_streaming = false
    max_container_log_line_size = 16384
    disable_cgroup = false
    disable_apparmor = false
    restrict_oom_score_adj = false
    max_concurrent_downloads = 3
    disable_proc_mount = false
    unset_seccomp_profile = ""
    tolerate_missing_hugetlb_controller = false
    disable_hugetlb_controller = false
    ignore_image_defined_volumes = false
    [plugins."io.containerd.grpc.v1.cri".containerd]
      snapshotter = "windows"
      default_runtime_name = "runhcs-wcow-process"
      no_pivot = false
      disable_snapshot_annotations = false
      discard_unpacked_layers = false
      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
        runtime_type = ""
        runtime_engine = ""
        runtime_root = ""
        privileged_without_host_devices = false
        base_runtime_spec = ""
      [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
        runtime_type = ""
        runtime_engine = ""
        runtime_root = ""
        privileged_without_host_devices = false
        base_runtime_spec = ""
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runhcs-wcow-process]
          runtime_type = "io.containerd.runhcs.v1"
          runtime_engine = ""
          runtime_root = ""
          privileged_without_host_devices = false
          base_runtime_spec = ""
    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "c:\\var\\lib\\rancher\\rke2\\bin"
      conf_dir = "c:\\var\\lib\\rancher\\rke2\\agent\\etc\\cni"
      max_conf_num = 1
      conf_template = ""
    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = "C:\\var\\lib\\rancher\\rke2\\agent\\etc\\containerd\\certs.d"




    [plugins."io.containerd.grpc.v1.cri".image_decryption]
      key_model = ""
    [plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
      tls_cert_file = ""
      tls_key_file = ""
  [plugins."io.containerd.internal.v1.opt"]
    path = "C:\\var\\lib\\rancher\\rke2\\agent\\containerd"
  [plugins."io.containerd.internal.v1.restart"]
    interval = "10s"
  [plugins."io.containerd.metadata.v1.bolt"]
    content_sharing_policy = "shared"
  [plugins."io.containerd.runtime.v2.task"]
    platforms = ["windows/amd64", "linux/amd64"]
  [plugins."io.containerd.service.v1.diff-service"]
    default = ["windows", "windows-lcow"]

mdrahman-suse · 2025-02-20T17:11:12Z

Can you get calico's and flannel's log? They are in C:\var\lib\rancher\rke2\agent\logs\

Here are the logs from Windows node CC @manuelbuil

Calico

calico-node.log
felix.log
kube-proxy.log
kubelet.log

Flannel

flanneld.log
kube-proxy.log
kubelet.log

mdrahman-suse · 2025-02-20T17:14:40Z

I see the new rc does not have below in the config.toml

 [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "c:\\var\\lib\\rancher\\rke2\\bin"
      conf_dir = "c:\\var\\lib\\rancher\\rke2\\agent\\etc\\cni"
      max_conf_num = 1
      conf_template = ""

Could that be the issue @brandond?
Ref: The error in containerd.log

time="2025-02-20T04:38:09.274398500Z" level=info msg="Get image filesystem path \"C:\\\\var\\\\lib\\\\rancher\\\\rke2\\\\agent\\\\containerd\\\\io.containerd.snapshotter.v1.windows\""
time="2025-02-20T04:38:09.277492100Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in C:\\Program Files\\containerd\\cni\\conf: cni plugin not initialized: failed to load cni config"

brandond · 2025-02-20T17:50:43Z

Apparently Linux RKE2 nodes use the default CNI bin path (it is unset in the config), but on Windows it is set to c:\var\lib\rancher\rke2\bin. I am not sure why this isn't consistent across both platforms, as it is on K3s - but that should be fixable.

manuelbuil · 2025-02-20T17:57:40Z

Can you get calico's and flannel's log? They are in C:\var\lib\rancher\rke2\agent\logs\

Here are the logs from Windows node CC @manuelbuil

Calico

calico-node.log felix.log kube-proxy.log kubelet.log

Flannel

flanneld.log kube-proxy.log kubelet.log

Just to add some information, the logs seem correct, so the network infrastructure should be well created. It is likely that the node can't find the cni binary as you guys are already discovering

shwethadec01 · 2025-02-21T06:00:28Z

Thanks @brandond for identifying the root cause of the issue, Would you be able to share an estimated timeline for when the fix might be available?

brandond · 2025-02-21T06:05:25Z

Before final release.

mdrahman-suse added the kind/bug Something isn't working label Feb 20, 2025

brandond mentioned this issue Feb 20, 2025

WinWorker Node Not Ready with RKE2 v1.31.6-rc1+rke2r1 Pre-Release #7771

Closed

brandond added this to the 2025-02 Release Cycle milestone Feb 20, 2025

brandond self-assigned this Feb 20, 2025

brandond added the status/release-blocker label Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rke2-windows] Windows node in `NotReady` state after it joins the cluster #7793

[rke2-windows] Windows node in `NotReady` state after it joins the cluster #7793

mdrahman-suse commented Feb 20, 2025

brandond commented Feb 20, 2025

mdrahman-suse commented Feb 20, 2025

brandond commented Feb 20, 2025 •

edited

Loading

siprbaum commented Feb 20, 2025

manuelbuil commented Feb 20, 2025 •

edited

Loading

mdrahman-suse commented Feb 20, 2025 •

edited

Loading

mdrahman-suse commented Feb 20, 2025 •

edited

Loading

mdrahman-suse commented Feb 20, 2025 •

edited

Loading

brandond commented Feb 20, 2025

manuelbuil commented Feb 20, 2025 •

edited

Loading

shwethadec01 commented Feb 21, 2025

brandond commented Feb 21, 2025

[rke2-windows] Windows node in NotReady state after it joins the cluster #7793

[rke2-windows] Windows node in NotReady state after it joins the cluster #7793

Comments

mdrahman-suse commented Feb 20, 2025

brandond commented Feb 20, 2025

mdrahman-suse commented Feb 20, 2025

Cluster status

brandond commented Feb 20, 2025 • edited Loading

siprbaum commented Feb 20, 2025

manuelbuil commented Feb 20, 2025 • edited Loading

mdrahman-suse commented Feb 20, 2025 • edited Loading

Latest rc

Previous release

mdrahman-suse commented Feb 20, 2025 • edited Loading

mdrahman-suse commented Feb 20, 2025 • edited Loading

brandond commented Feb 20, 2025

manuelbuil commented Feb 20, 2025 • edited Loading

shwethadec01 commented Feb 21, 2025

brandond commented Feb 21, 2025

[rke2-windows] Windows node in `NotReady` state after it joins the cluster #7793

[rke2-windows] Windows node in `NotReady` state after it joins the cluster #7793

brandond commented Feb 20, 2025 •

edited

Loading

manuelbuil commented Feb 20, 2025 •

edited

Loading

mdrahman-suse commented Feb 20, 2025 •

edited

Loading

mdrahman-suse commented Feb 20, 2025 •

edited

Loading

mdrahman-suse commented Feb 20, 2025 •

edited

Loading

manuelbuil commented Feb 20, 2025 •

edited

Loading