Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix mount propagation on rootfs for containerized node #13327

Merged
merged 1 commit into from
Mar 21, 2017

Conversation

sjenning
Copy link
Contributor

@sjenning sjenning commented Mar 9, 2017

xref https://bugzilla.redhat.com/show_bug.cgi?id=1427807

For the containerized node, the host's rootfs is mounted at /rootfs in the node container. Back in docker 1.10.3 days, the default propagation mode was rslave by default. However in 1.12 is it rprivate.

This creates a problem when the node process nsenters the host mount namespace to mount volumes. If the node service is restarted (i.e. the container is stopped, removed, then started again), any volume mount points in the host are mounted rprivate in /rootfs in the node container. When any pods using the volumes are deleted, the node deletes the mount point in the host, but the /rootfs mount point in the node container is not updated due to being rprivate and prevents the volume from detaching with device is busy.

This PR enforces rslave on the /rootfs volume so that volume detach can complete.

@derekwaynecarr @eparis @rhatdan @gnufied @chao007 @wongma7

@derekwaynecarr
Copy link
Member

@csrwng -- do you do the same w/ oc cluster up?

@csrwng
Copy link
Contributor

csrwng commented Mar 9, 2017

@derekwaynecarr we currently mount the volumes dir with :rslave, but do not specify a propagation mode for /rootfs. Inspecting my local rhel machine, (running docker-common-1.12.5-14.el7.x86_64), the /rootfs mount is set to rprivate propagation mode. So we have 2 mounts:

-v /var/lib/origin/openshift.local.volumes:/var/lib/origin/openshift.local.volumes:rslave
-v /:/rootfs (which ends up being :rprivate)

@eparis
Copy link
Member

eparis commented Mar 9, 2017

@sdodson does this need to be fixed somewhere in ansible too? or does ansible get it from here?

@csrwng
Copy link
Contributor

csrwng commented Mar 9, 2017

Sounds like the same change needs to be made in the cluster up code.
https://github.com/openshift/origin/blob/master/pkg/bootstrap/docker/openshift/helper.go#L252

@rhvgoyal
Copy link

rhvgoyal commented Mar 9, 2017

Shouldn't container first exit/stop before we try to detach volume

@sdodson
Copy link
Member

sdodson commented Mar 9, 2017

@sdodson
Copy link
Member

sdodson commented Mar 9, 2017

@giuseppe Also, need to check for this on system containers

@derekwaynecarr
Copy link
Member

I want to make sure we all think this is the right fix. @pmorie can you weigh in as resident propagation mode expert.

@pmorie
Copy link
Contributor

pmorie commented Mar 9, 2017

We should probably have specified this in the systemd unit file always instead of relying on it being the default. This change LGTM.

@giuseppe
Copy link
Member

giuseppe commented Mar 9, 2017

@sdodson thanks for tagging me. For the system container I am already using rslave for the rootfs propagation: https://github.com/openshift/origin/blob/master/images/node/system-container/config.json.template#L318

@gnufied
Copy link
Member

gnufied commented Mar 9, 2017

Why do we need to mount all of "/" inside container? don't we just need "/proc" ?

@sjenning
Copy link
Contributor Author

sjenning commented Mar 9, 2017

@gnufied it needs at least /rootfs/sys and /rootfs/var/run which are explicitly checked.

https://github.com/openshift/origin/blob/master/pkg/cmd/server/kubernetes/node.go#L75-L85

Strangely /rootfs/proc is not listed but definitely is used to enter the host mount namespace.

Since we use --pid=host on the docker run line, we could probably just mount -t proc none /proc inside the container rather than using the one from the host, but that is a different discussion.

@derekwaynecarr
Copy link
Member

[merge]

@sjenning
Copy link
Contributor Author

merged failed on openshift/openshift-ansible#3603. can someone remerge this?

@eparis
Copy link
Member

eparis commented Mar 16, 2017

[merge] because of the flake he pointed out

@sjenning
Copy link
Contributor Author

sjenning commented Mar 16, 2017

@stevekuznetsov last merge run failed at

13:44:37 TASK [validate-public : Validate the public address] ***************************
13:44:37 Thursday 16 March 2017  17:44:37 +0000 (0:00:00.379)       0:16:18.595 ******** 
13:44:37 
fatal: [localhost]: FAILED! => {"changed": false, "content": "", "failed": true, "msg": "Status code was not [200]: An unknown error occurred: ''", "redirected": false, "status": -1, "url": "https://api.prtest-5a37c28-323.origin-ci-int-gce.dev.rhcloud.com:443/healthz/ready"}

Any help? Not sure if this is a flake or would happen again if someone bumped it.

@sdodson
Copy link
Member

sdodson commented Mar 16, 2017

flake openshift/origin-gce#15

@stevekuznetsov
Copy link
Contributor

Flake is openshift/origin-gce#15

@sjenning
Copy link
Contributor Author

@sdodson @stevekuznetsov thanks. Can I get a merge again?

@eparis
Copy link
Member

eparis commented Mar 20, 2017

[merge] again

@openshift-bot
Copy link
Contributor

[Test]ing while waiting on the merge queue

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to c9174c2

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/327/) (Base Commit: 612dcfb)

@sjenning
Copy link
Contributor Author

flaaakes
#13183
#12544

@eparis
Copy link
Member

eparis commented Mar 20, 2017

you heard him, we [merge] after flakes!

@eparis
Copy link
Member

eparis commented Mar 21, 2017

I really should write myself a bot to auto re tag [merge] on these things.
https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/154/

@eparis
Copy link
Member

eparis commented Mar 21, 2017

I filed #13485
[merge] while I look for the other.

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to c9174c2

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/merge FAILURE (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/167/) (Base Commit: 0343989)

@eparis
Copy link
Member

eparis commented Mar 21, 2017

I'm done with this PR.

@eparis eparis merged commit 1e2d71b into openshift:master Mar 21, 2017
@stevekuznetsov
Copy link
Contributor

We don't have a full Ansible containerized deployment test on the merge queue for this ... so @sdodson FYI this may impact you guys?

@sdodson
Copy link
Member

sdodson commented Mar 22, 2017

Yes, the bug actually gets fixed in the installer anyway
openshift/openshift-ansible#3727 master
openshift/openshift-ansible#3728 release-1.5
openshift/openshift-ansible#3729 release-1.4

@stevekuznetsov
Copy link
Contributor

Nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.