Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent incorrect deletion of HostSubnet OVS flows #19080

Merged
merged 2 commits into from
Mar 30, 2018

Conversation

danwinship
Copy link
Contributor

@danwinship danwinship commented Mar 23, 2018

In certain circumstances we can end up with two HostSubnets having the same HostIP, which will then eventually cause nodes to delete the wrong flows:

  1. node1 and node2 register with the master, the SDN master creates HostSubnets for them, and node0 observes this and creates OVS flows for them
  2. The underlying VMs (or whatever) backing node1 and node2 are deleted. However, their Node objects are not deleted (and therefore neither are their HostSubnets).
  3. node1 and node2 are recreated. Node2 comes up first, requests an IP address from its cloud/hypervisor, and is given node1's old IP address. It registers with the master, updating node2's Node with the new IP address. (There are now two Node objects claiming the same IP. This is apparently allowed.)
  4. The SDN master sees that node2's Node's IP has changed, and changes node2's HostSubnet to match. (There are now two HostSubnet objects claiming the same IP.)
  5. node0 sees the HostSubnet update and responds by deleting its old node2 flows and creating new ones. (There are now two sets of OVS flows referring to the same node IP.)
  6. Now node1 finishes booting, gets a new IP address, and updates its Node. The SDN master updates its HostSubnet. And node0 sees the update and tries to delete its old node1 flows, and accidentally ends up deleting one of the new node2 flows instead, because of the reused IP address.

This is related to #18617 but it's not the same bug; there the HostSubnet.Subnet field is changing, which would require that the HostSubnet was actually deleted and recreated. (Which means I still don't know what's really going on in #18617.)

Anyway, the fix I went with is to add a cookie based on the HostSubnet.UID to the OVS flows, and then delete based on that rather than on the HostIP/Subnet.

I don't think this problem affects any other OVS flows; pod and service IPs should actually be enforced to be unique, so this wouldn't be able to happen. But maybe we should do UID-based cookies on everything anyway?

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1538220

@danwinship danwinship added kind/bug Categorizes issue or PR as related to a bug. component/networking sig/networking labels Mar 23, 2018
@danwinship danwinship requested review from pravisankar and dcbw March 23, 2018 18:46
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 23, 2018
@openshift-ci-robot openshift-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 23, 2018
@dcbw
Copy link
Contributor

dcbw commented Mar 26, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 26, 2018
}
}
if err := hsw.networkInfo.ValidateNodeIP(hs.HostIP); err != nil {
return fmt.Errorf("Ignoring invalid subnet for node %s: %v", hs.HostIP, err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt.Errorf("ignoring...

otx.DeleteFlows("table=10, tun_src=%s", subnet.HostIP)
otx.DeleteFlows("table=50, arp, nw_dst=%s", subnet.Subnet)
otx.DeleteFlows("table=90, ip, nw_dst=%s", subnet.Subnet)
otx.DeleteFlows("table=10, cookie=0x%08x", cookie)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could use both cookie and ip here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I would add a test, but that would require actually coming up with a partial sha256 collision, and if I'm gonna do that I want to end up with some bitcoins at the end!

@pravisankar pravisankar added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 26, 2018
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2018
Copy link

@pravisankar pravisankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@pravisankar
Copy link

/test gcp

@danwinship danwinship removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 28, 2018
@danwinship
Copy link
Contributor Author

/retest

1 similar comment
@danwinship
Copy link
Contributor Author

/retest

@pravisankar
Copy link

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 29, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, dcbw, pravisankar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pravisankar
Copy link

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@pravisankar
Copy link

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@smarterclayton
Copy link
Contributor

If this is severe enough please cherry-pick to origin 3.9 once merged.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 2884ee8 into openshift:master Mar 30, 2018
@danwinship danwinship deleted the subnet-tracking branch April 7, 2018 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. component/networking kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. sig/networking size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants