RestartPolicy doesn't make sense for static pods #130288

tallclair · 2025-02-19T20:34:31Z

/kind bug

Static pods should only ever have a restart policy of always. Anything else doesn't make sense, since the Kubelet doesn't track the pod status in a persistent way.

I don't think we can fail validation for backwards-compatibility, but maybe we can just unconditionally overwrite the restart policy when static pods are parsed.

/sig node

tallclair · 2025-02-19T20:34:59Z

/priority important-longterm
/triage accepted
/cc @yujuhong @SergeyKanzhelev

tallclair · 2025-02-19T20:35:16Z

Old discussion: #34003

yujuhong · 2025-02-19T21:05:36Z

Since we're talking about static pod, there's also an open issue to add more validation: #103587

ajaysundark · 2025-02-21T00:51:46Z

I can help take a look at this. Do we have a consensus on setting always as the default restartPolicy for static pods?

GunaKKIBM · 2025-02-21T05:44:18Z

/assign

sftim · 2025-02-21T17:40:07Z

I'm not sure about this.

Logically, if a container fails and that would trigger a non-static Pod to terminate, I think the kubelet should delete the whole Pod and immediately make a new one (including init container execution, reinitialization of sidecars, etc). The static pod declares desired state and we should honor it; in this case, without modification.

(IMO) the kubelet doesn't need to do an API server write to the mirror Pod before tearing down the Pod sandbox and making a fresh one, but make a fresh sandbox it should. For example, making that new sandbox could even trigger creation of a new microVM.

Handling failure in this way could also help a static Pod recover from a partial node failure, such as a CPU going offline when the container runtime is backed by a partitioning hypervisor. Unlikely, but there's nothing in our conformance testing to say "don't use a partitioning hypervisor" or indeed "don't keep running after a partial hardware failure".

Why wouldn't we do the Pod sandbox replacement as I've outlined?

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Feb 19, 2025

github-project-automation bot added this to SIG Node Bugs Feb 19, 2025

github-project-automation bot moved this to Triage in SIG Node Bugs Feb 19, 2025

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 19, 2025

tallclair mentioned this issue Feb 19, 2025

e2e: refactor FilterNonRestartablePods function #127071

Open

k8s-ci-robot assigned GunaKKIBM Feb 21, 2025

ajaysundark linked a pull request Feb 21, 2025 that will close this issue

set restartpolicy default as always for static pods #130334

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RestartPolicy doesn't make sense for static pods #130288

RestartPolicy doesn't make sense for static pods #130288

tallclair commented Feb 19, 2025

tallclair commented Feb 19, 2025

tallclair commented Feb 19, 2025

yujuhong commented Feb 19, 2025

ajaysundark commented Feb 21, 2025

GunaKKIBM commented Feb 21, 2025

sftim commented Feb 21, 2025 •

edited

Loading

RestartPolicy doesn't make sense for static pods #130288

RestartPolicy doesn't make sense for static pods #130288

Comments

tallclair commented Feb 19, 2025

tallclair commented Feb 19, 2025

tallclair commented Feb 19, 2025

yujuhong commented Feb 19, 2025

ajaysundark commented Feb 21, 2025

GunaKKIBM commented Feb 21, 2025

sftim commented Feb 21, 2025 • edited Loading

sftim commented Feb 21, 2025 •

edited

Loading