Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect unhandled reboots and require user intervention #22278

Merged

Conversation

mheon
Copy link
Member

@mheon mheon commented Apr 5, 2024

Podman needs to be able to detect when a system reboot occurs to do certain types of cleanup operation (for example, reset container states, clean up IPAM allocations, etc). our current method for this is a sentinel file on a tmpfs filesystem. The problem emerges that there is no directory that is guaranteed to be a tmpfs and is also guaranteed to be accessible to rootless users in the FHS. If the user has a systemd user session, we can depend on /run/user/$UID, but we can't reliably say that they do.

This code will detect the no-tmpfs-but-reboot-occurred case by writing the current system boot ID to our tmpfs sentinel file when it is created, and checking that file every time Podman starts to make sure that the current boot ID matches the cached one in the sentinel file. If they don't match, a reboot occurred and the sentinel file was not on a tmpfs and thus survived. In that case, throw an error telling the user to remove certain directories (the ones that are supposed to be tmpfs), so we can proceed as expected.

Does this PR introduce a user-facing change?

Podman now detects unhandled system reboots and advises the user on proper mitigations.

Copy link
Contributor

openshift-ci bot commented Apr 5, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mheon mheon added No New Tests Allow PR to proceed without adding regression tests and removed release-note labels Apr 5, 2024
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 5, 2024
@mheon
Copy link
Member Author

mheon commented Apr 5, 2024

Marking as No New Tests because we can't really reboot our CI VMs and we don't expose the path of the sentinel file anywhere in our public API. We'd have to run podman info --log-level=debug, parse tmpdir out of the debug output (ew), and then hard-code what the filename is supposed to be in the tests.

@Luap99
Copy link
Member

Luap99 commented Apr 5, 2024

Marking as No New Tests because we can't really reboot our CI VMs and we don't expose the path of the sentinel file anywhere in our public API. We'd have to run podman info --log-level=debug, parse tmpdir out of the debug output (ew), and then hard-code what the filename is supposed to be in the tests.

That would properly be a good case for the fedora test day test cases or so, easy enough to run mkdir "tmpdir" on persistent disk then run podman --root tmpdir/root --runroot tmpdir/runroot --tmpdir tmpdir/tmp info then have them reboot and run that again to confirm it fails with the new error.

@baude
Copy link
Member

baude commented Apr 5, 2024

LGTM

@@ -624,6 +624,25 @@ func makeRuntime(ctx context.Context, runtime *Runtime) (retErr error) {
}
}

// Check current boot ID - will be written to the alive file.
systemBootID, err := os.ReadFile("/proc/sys/kernel/random/boot_id")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this neeeds to be extracted to linux specific file as this file will not exists on freebsd

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're OK, we don't actually report errors on this failing - we just don't proceed with comparing the boot ID. We could abstract it, but the only benefit would a few less syscalls on freebsd.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it would not result in errors but I think it would be better, we need to remember that this code must work even on freebsd. Maybe someone later decides hey this should handle the error and fail if we cannot read it...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Podman needs to be able to detect when a system reboot occurs to
do certain types of cleanup operation (for example, reset
container states, clean up IPAM allocations, etc). our current
method for this is a sentinel file on a tmpfs filesystem. The
problem emerges that there is no directory that is guaranteed to
be a tmpfs and is also guaranteed to be accessible to rootless
users in the FHS. If the user has a systemd user session, we can
depend on /run/user/$UID, but we can't reliably say that they do.

This code will detect the no-tmpfs-but-reboot-occurred case by
writing the current system boot ID to our tmpfs sentinel file
when it is created, and checking that file every time Podman
starts to make sure that the current boot ID matches the cached
one in the sentinel file. If they don't match, a reboot occurred
and the sentinel file was not on a tmpfs and thus survived. In
that case, throw an error telling the user to remove certain
directories (the ones that are supposed to be tmpfs), so we can
proceed as expected.

Signed-off-by: Matt Heon <[email protected]>
@mheon mheon force-pushed the error_on_unhandled_reboot branch from 04fb932 to 3560ccd Compare April 5, 2024 14:07
@baude
Copy link
Member

baude commented Apr 5, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 5, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 41a710b into containers:main Apr 5, 2024
94 checks passed
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jul 5, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Jul 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. No New Tests Allow PR to proceed without adding regression tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants