Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sdn: garbage-collect dead containers to recover IPAM leases #11964

Merged
merged 2 commits into from
Nov 21, 2016

Conversation

dcbw
Copy link
Contributor

@dcbw dcbw commented Nov 18, 2016

@dcbw
Copy link
Contributor Author

dcbw commented Nov 18, 2016

@openshift/networking

@dcbw
Copy link
Contributor Author

dcbw commented Nov 18, 2016

Clearly, this is a hack. The real upstream fix is more like kubernetes/kubernetes#37036

return fmt.Errorf("error running network teardown script: %s", getScriptError(out))
} else if err != nil {
return err
}
}

if _, err := m.runIPAM(req.Netns, cniserver.CNI_DEL, req.ContainerId); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we're potentially passing an invalid req.Netns to runIPAM.

It looks like CNI_NETNS is optional for CNI_DEL anyway, so how about we don't pass it here? (Likewise in the cleanup code)


// ipamGarbageCollection will release unused IP.
// openshift-sdn uses the CNI host-local IPAM plugin which stores allocated IPs
// in a file. Each file created under the IPAM directory has the format:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh. so is this all going to go away when we get kubernetes/kubernetes#37036? If not, then should it go into the host-local plugin instead? (Particularly so that if we rebase CNI later and they've changed things, we'll get a merge failure rather than a silent breakage.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should all go away. No, it can't go into host-local because host-local has no idea about all the running containers and whether a lease is dead or not. It's just an allocator, it doesn't know anything about containers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to track this for the kube merge? I don't see anything in https://github.com/openshift/origin/blob/master/HACKING.md that would mark it for replacement. Do we just set something on the board to remember to fix this at a certain point?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "getting the list of containers host-local knows about" part could go into host-local. But if it's going away soon I don't care much

// ipamGarbageCollection will release unused IP.
// openshift-sdn uses the CNI host-local IPAM plugin which stores allocated IPs
// in a file. Each file created under the IPAM directory has the format:
// ip/container-hash. So this routine looks for hashes that are not reported by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Each file created under the IPAM directory has an IP address for its filename and contains a container hash". (or something)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can certainly change the wording, this is mostly just a port of the corresponding kubenet workaround for this issue. I guess it's not a huge problem to diverge from that commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current wording is just wrong, which then makes it look like the code is wrong since it doesn't correspond to comment.

@dcbw dcbw force-pushed the sdn-fix-ipam-exhaustion branch from 80caec8 to 011f4ee Compare November 18, 2016 17:17
@dcbw
Copy link
Contributor Author

dcbw commented Nov 18, 2016

@danwinship fixed up and rebased, PTAL!

@dcbw dcbw force-pushed the sdn-fix-ipam-exhaustion branch from 011f4ee to 189326a Compare November 18, 2016 17:32
@@ -213,35 +217,41 @@ func addMacvlan(netns string) error {

// Run CNI IPAM for the container, either allocating an IP address and returning
// it (for ADD) or releasing the lease and cleaning up (for DEL)
func (m *podManager) runIPAM(netnsPath string, action cniserver.CNICommand, id string) (*cnitypes.Result, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment now out of date

}
func (m *podManager) ipamAdd(netnsPath string, id string) (*cnitypes.Result, error) {
if netnsPath == "" {
return nil, fmt.Errorf("netns requires for CNI_ADD")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requireD

const ipamDir string = "/var/lib/cni/networks/openshift-sdn"
files, err := ioutil.ReadDir(ipamDir)
if err != nil {
glog.Errorf("Failed to list files in CNI host-local IPAM store: %v", ipamDir, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as govet pointed out, there's an arg mismatch here

dcbw added 2 commits November 18, 2016 12:26
We don't actually need the netns for IPAM teardown, so don't bother
passing it.  Also ensure we can release an IPAM allocation even
when the container has already died, in case kubelet didn't call
the plugin in this case.
@dcbw dcbw force-pushed the sdn-fix-ipam-exhaustion branch from 189326a to 0d5f8c0 Compare November 18, 2016 18:27
@knobunc
Copy link
Contributor

knobunc commented Nov 18, 2016

[test]

Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too. What needs to happen when we merge in the upstream version? Do we remember to change this?

@dcbw
Copy link
Contributor Author

dcbw commented Nov 18, 2016

What needs to happen when we merge in the upstream version? Do we remember to change this?

I'll file an issue to revert it once upstream lands. EDIT: that issue is #11970

@knobunc
Copy link
Contributor

knobunc commented Nov 21, 2016

[merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 0d5f8c0

@knobunc
Copy link
Contributor

knobunc commented Nov 21, 2016

[test]

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 0d5f8c0

@openshift-bot
Copy link
Contributor

openshift-bot commented Nov 21, 2016

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/11601/) (Base Commit: 3b2bbe5) (Image: devenv-rhel7_5393)

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/11598/) (Base Commit: 3b2bbe5)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants