Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s.io] SchedulerPredicates [Serial] validates that NodeSelector is respected if not matching [Conformance] [Suite:openshift/conformance/serial] [Suite:k8s] 10m9s #17682

Closed
tnozicka opened this issue Dec 8, 2017 · 21 comments
Assignees
Labels
kind/test-flake Categorizes issue or PR as related to test flakes. priority/P0

Comments

@tnozicka
Copy link
Contributor

tnozicka commented Dec 8, 2017

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17599/test_pull_request_origin_extended_conformance_gce/12622/

/tmp/openshift/build-rpm-release/tito/rpmbuild-origin9qJhqo/BUILD/origin-3.7.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/scheduling/predicates.go:242
Dec  8 13:15:38.571: Timed out after 10m0s waiting for stable cluster.
/tmp/openshift/build-rpm-release/tito/rpmbuild-origin9qJhqo/BUILD/origin-3.7.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/util.go:5082
@tnozicka tnozicka added kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1 labels Dec 8, 2017
@tnozicka
Copy link
Contributor Author

tnozicka commented Dec 8, 2017

blocking queue for release-3.7

@mfojtik
Copy link
Contributor

mfojtik commented Dec 12, 2017

This is flaking also on 3.9 branch now. For 3.7 I think we can pick #17724

@mfojtik mfojtik assigned sjenning and unassigned mfojtik Dec 12, 2017
@mfojtik
Copy link
Contributor

mfojtik commented Dec 12, 2017

@sjenning @deads2k said this is an important feature and we should know why we are flaking here. I made a PR that disable the test for master to unblock the queue, but PTAL before merging that if there is something obvious that is broken.

@aveshagarwal
Copy link
Contributor

I am looking into this.

@aveshagarwal
Copy link
Contributor

it seems that the test is not even running as its timed out here (WaitForStableCluster): https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/test/e2e/scheduling/predicates.go#L310

Dec  8 15:45:34.445: Timed out after 10m0s waiting for stable cluster.

Still looking what else is going on that is causing this to time out.

@aveshagarwal
Copy link
Contributor

The issue is that router pod is failing to schedule:

Dec  8 15:45:34.493: INFO: router-1-deploy                                              Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2017-12-08 15:21:32 +0000 UTC Unschedulable 0/4 nodes are available: 4 CheckServiceAffinity, 4 MatchNodeSelector.}]

@aveshagarwal
Copy link
Contributor

This is flaking also on 3.9 branch now

@mfojtik can you share link for 3.9 logs too? want to make sure if its always router pod or some other pod that is causing time out.

@smarterclayton
Copy link
Contributor

MatchNodeSelector can only happen if either ansible or the installer broke how node labels are working.

@aveshagarwal
Copy link
Contributor

@smarterclayton @stevekuznetsov @sdodson where to check what is deploying router-1-deploy in https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_conformance_gce/12622/consoleFull . Also why I dont see the same router-1-deploy being run/deployed in recent runs of gce conformance tests: https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_conformance_gce/12864/consoleFull

@aveshagarwal
Copy link
Contributor

@mfojtik in recent runs of gce conformance tests, I dont see that any scheduler predicates are being run. For example this one: https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_conformance_gce/12864/consoleFull

@stevekuznetsov
Copy link
Contributor

stevekuznetsov commented Dec 13, 2017

@aveshagarwal I would assume that the installer is placing the router down.

@stevekuznetsov
Copy link
Contributor

@sdodson may be able to help you figure that out

@sdodson
Copy link
Member

sdodson commented Dec 13, 2017

MatchNodeSelector can only happen if either ansible or the installer broke how node labels are working.

@michaelgugino can you look into this? I don't see any reason for it, but it looks like it's using the default selector of region=infra rather than the selector that's specified in the group vars using openshift_hosted_infra_selector: "role=infra" I believe this is only happening on release-3.7 based on all of the reported failures above.

Can someone confirm that those jobs would have used the release-3.7 branch of ansible, I don't see from the logs where openshift-ansible gets installed.

@sdodson
Copy link
Member

sdodson commented Dec 13, 2017

/assign michaelgugino

@sdodson
Copy link
Member

sdodson commented Dec 13, 2017

/assign sdodson

@aveshagarwal
Copy link
Contributor

https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_conformance_k8s/182/consoleFull#9553647145898c58db7602c31c0eab717

Again both scheduler predicate tests are failing for the same reason as the test for stable cluster is timing out because router-1-deploy pod is failing to schedule:

Dec 14 12:47:18.978: INFO: router-1-deploy                                            Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2017-12-14 11:39:48 +0000 UTC Unschedulable 0/4 nodes are available: 1 NodeUnschedulable, 4 MatchNodeSelector.}]

@tnozicka
Copy link
Contributor Author

tnozicka commented Dec 20, 2017

@tnozicka tnozicka reopened this Dec 20, 2017
@nak3 nak3 closed this as completed in f119b7c Dec 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/test-flake Categorizes issue or PR as related to test flakes. priority/P0
Projects
None yet
Development

No branches or pull requests

9 participants