-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node IP flip flops #13645
Node IP flip flops #13645
Conversation
@openshift/networking @knobunc Please review. Fixes this bug, where a node is rebooted alongwith the master, and node reports with a flipped status address. It causes the hostsubnet to be re-assigned. |
Looking at setNodeAddress() in kubelet_node_status.go, in case of cloud provider multiple addresses are stored in node status but in non-cloud provider case, only one address and hostname address are stored in node status. The chosen node address is dependent on host-name/node-name/ChooseHostInterface(). So we could end up in a situation where nodeIP stored in HostSubnet is not in the valid addresses we got from node status. We do support desired nodeIP to use in openshift config file. So the bug fix will be to delete all the HostSubnets for troubled nodes, specify desired nodeIP to use in openshift config and restart openshift-node service? |
@rajatchopra A few thoughts: |
@pecameron |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, but can you make the commit heading a little better so when we git log it's clear. And some detail in the commit message wouldn't hurt.
Thanks!
@pravisankar You are correct, but we need some way to allow existing clusters to not malfunction, and this is the easiest patch I could find short of the migration script. When the Trello card is implemented we may not need this patch, but we need a fix for the release before that feature is in. @pecameron The Trello card is the comprehensive way of dealing with this problem as Ravi pointed out. This PR is to provide a patch fix until that card is done. |
* where when a node is rebooted alongwith the master, and node reports with a flipped status address. It causes the hostsubnet to be re-assigned. * the fix looks at all existing valid addresses and if the existing hostsubnet has a nodeIP that is among the valid ones, then no update is performed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenShift nodeIP config is honored both in cloud and non-cloud provider cases for 3.5 or later releases but before 3.5 release, it was only honoring for non-cloud provider case.
Any cluster running with 3.5 or later release doesn't need this fix and can use nodeIP config but this code will help as a safe guard. Clusters running older releases will benefit from this fix. Back porting this fix will be more useful.
LGTM
cc: @rajatchopra @dcbw
// addNode takes the nodeName, a preferred nodeIP, the node's annotations and other valid ip addresses | ||
// Creates or updates a HostSubnet if needed | ||
// Returns the IP address used for hostsubnet (either the preferred or one from the otherValidAddresses) and any error | ||
func (master *OsdnMaster) addNode(nodeName string, nodeIP string, hsAnnotations map[string]string, otherValidAddresses []kapi.NodeAddress) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably just make this take the whole Node object at this point.
and maybe return the HostSubnet rather than just the IP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The F5 ghost node thing prevented me from doing that.. F5 exists as a hostsubnet, but there is no Node object for it.
re-[test] last timed out waiting for copr. |
Evaluated for origin test up to dacd766 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/679/) (Base Commit: 1d26c38) |
[merge] |
Evaluated for origin merge up to dacd766 |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/306/) (Base Commit: e3dff57) (Image: devenv-rhel7_6134) |
when both master and node reboot: bz1438402
[test]