Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node IP flip flops #13645

Merged
merged 1 commit into from
Apr 10, 2017
Merged

Node IP flip flops #13645

merged 1 commit into from
Apr 10, 2017

Conversation

rajatchopra
Copy link
Contributor

when both master and node reboot: bz1438402

[test]

@rajatchopra
Copy link
Contributor Author

@openshift/networking @knobunc Please review.

Fixes this bug, where a node is rebooted alongwith the master, and node reports with a flipped status address. It causes the hostsubnet to be re-assigned.
The fix looks at all existing valid addresses and if the existing hostsubnet has a nodeIP that is among the valid ones, then no update is performed.

@pravisankar
Copy link

Looking at setNodeAddress() in kubelet_node_status.go, in case of cloud provider multiple addresses are stored in node status but in non-cloud provider case, only one address and hostname address are stored in node status. The chosen node address is dependent on host-name/node-name/ChooseHostInterface(). So we could end up in a situation where nodeIP stored in HostSubnet is not in the valid addresses we got from node status.

We do support desired nodeIP to use in openshift config file. So the bug fix will be to delete all the HostSubnets for troubled nodes, specify desired nodeIP to use in openshift config and restart openshift-node service?
Once we finish this card https://trello.com/c/sCNKKYCz/375-5-allow-segregating-cluster-traffic-from-management-traffic-sdn-functionality , user can also choose desired network interface and/or node IP.

@pecameron
Copy link
Contributor

@rajatchopra A few thoughts:
Your discussion in the comment should be in the description above.
The discussion should also be in comments in the code.
This seems to rely on current contents of map, this dependency needs to be called out in a comment.
I am running a cluster in the lab where em3 is the lab network and em1 is the cluster internal network for both master to node and pod to pod. Does this fix work in that environment? Can I put the master to node on em4 and leave the pod to pod on em2?
Needs a test.

@pravisankar
Copy link

@pecameron
Your requirement is what I'm working on right now, trello card: https://trello.com/c/sCNKKYCz/375-5-allow-segregating-cluster-traffic-from-management-traffic-sdn-functionality (master to node communication on one network interface and pod to pod on other interface). This PR tried to use same IP for the node during openshift restarts when the system has multiple interfaces/IPs on the node. I was arguing this may not fix the issue and can use nodeIP openshift config for this use case.

Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, but can you make the commit heading a little better so when we git log it's clear. And some detail in the commit message wouldn't hurt.

Thanks!

@rajatchopra
Copy link
Contributor Author

rajatchopra commented Apr 7, 2017

We do support desired nodeIP to use in openshift config file. So the bug fix will be to delete all the HostSubnets for troubled nodes, specify desired nodeIP to use in openshift config and restart openshift-node service?
Once we finish this card https://trello.com/c/sCNKKYCz/375-5-allow-segregating-cluster-traffic-from-management-traffic-sdn-functionality , user can also choose desired network interface and/or node IP.

@pravisankar You are correct, but we need some way to allow existing clusters to not malfunction, and this is the easiest patch I could find short of the migration script. When the Trello card is implemented we may not need this patch, but we need a fix for the release before that feature is in.

@pecameron The Trello card is the comprehensive way of dealing with this problem as Ravi pointed out. This PR is to provide a patch fix until that card is done.

 * where when a node is rebooted alongwith the master, and node reports with a flipped status address. It causes the hostsubnet to be re-assigned.
 * the fix looks at all existing valid addresses and if the existing hostsubnet has a nodeIP that is among the valid ones, then no update is performed.
Copy link

@pravisankar pravisankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenShift nodeIP config is honored both in cloud and non-cloud provider cases for 3.5 or later releases but before 3.5 release, it was only honoring for non-cloud provider case.

Any cluster running with 3.5 or later release doesn't need this fix and can use nodeIP config but this code will help as a safe guard. Clusters running older releases will benefit from this fix. Back porting this fix will be more useful.
LGTM
cc: @rajatchopra @dcbw

// addNode takes the nodeName, a preferred nodeIP, the node's annotations and other valid ip addresses
// Creates or updates a HostSubnet if needed
// Returns the IP address used for hostsubnet (either the preferred or one from the otherValidAddresses) and any error
func (master *OsdnMaster) addNode(nodeName string, nodeIP string, hsAnnotations map[string]string, otherValidAddresses []kapi.NodeAddress) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably just make this take the whole Node object at this point.

and maybe return the HostSubnet rather than just the IP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The F5 ghost node thing prevented me from doing that.. F5 exists as a hostsubnet, but there is no Node object for it.

@knobunc
Copy link
Contributor

knobunc commented Apr 10, 2017

re-[test] last timed out waiting for copr.

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to dacd766

Copy link
Contributor

@dcbw dcbw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/679/) (Base Commit: 1d26c38)

@danwinship
Copy link
Contributor

[merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to dacd766

@openshift-bot
Copy link
Contributor

openshift-bot commented Apr 10, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/306/) (Base Commit: e3dff57) (Image: devenv-rhel7_6134)

@openshift-bot openshift-bot merged commit 445ae00 into openshift:master Apr 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants