Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[localhost] openshift_web_console : Verify that the web console is running #18569

Closed
mfojtik opened this issue Feb 12, 2018 · 14 comments
Closed
Assignees
Labels
component/web kind/test-flake Categorizes issue or PR as related to test flakes. priority/P0 sig/user-interface

Comments

@mfojtik
Copy link
Contributor

mfojtik commented Feb 12, 2018

See: https://openshift-gce-devel.appspot.com//build/origin-ci-test/pr-logs/pull/18524/test_pull_request_origin_extended_conformance_install/7274/

Web Console Install        : In Progress (0:06:22)
	This phase can be restarted by running: playbooks/openshift-web-console/config.yml
fatal: [localhost]: FAILED! => {
    "attempts": 60, 
    "changed": false, 
    "cmd": [
        "curl", 
        "-k", 
        "https://webconsole.openshift-web-console.svc/healthz"
    ], 
    "delta": "0:00:01.037439", 
    "end": "2018-02-12 07:30:14.341591", 
    "generated_timestamp": "2018-02-12 07:30:14.367970", 
    "msg": "non-zero return code", 
    "rc": 7, 
    "start": "2018-02-12 07:30:13.304152", 
    "stderr": [
        "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current", 
        "                                 Dload  Upload   Total   Spent    Left  Speed", 
        "", 
        "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0", 
        "  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0curl: (7) Failed connect to webconsole.openshift-web-console.svc:443; Connection refused"
    ], 
    "stdout": []
}

Flaked 40 times in last 10 days, marking as P0. The controller logs for webconsole deployment: https://gist.github.com/mfojtik/4206e4d49253dba05137d0159345eef3

@mfojtik mfojtik added component/web priority/P0 kind/test-flake Categorizes issue or PR as related to test flakes. sig/user-interface labels Feb 12, 2018
@mfojtik
Copy link
Contributor Author

mfojtik commented Feb 12, 2018

@spadgett @tnozicka according to controller logs I can see that the replica set reached 1 replica and there is no indication of deployment error. I would help if we can dump the oc get all --all when this happen to see what state the console pods are.

Also perhaps dumping the webconsole container logs to see what is going on there might help.

@tnozicka
Copy link
Contributor

@tnozicka
Copy link
Contributor

still looking into it but can this new health check be failing? https://github.com/openshift/origin/pull/18411/files

@tnozicka
Copy link
Contributor

I didn't see anything broken on controller side for this in master logs; looks like the pod there, just not up.

@spadgett
Copy link
Member

The webconsole container is crash looping.

Error syncing pod b309476d-0fc5-11e8-9b1a-0e908b42fe1e ("webconsole-1461709481-4ntbk_openshift-web-console(b309476d-0fc5-11e8-9b1a-0e908b42fe1e)"), skipping: failed to "StartContainer" for "webconsole" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=webconsole pod=webconsole-1461709481-4ntbk_openshift-web-console(b309476d-0fc5-11e8-9b1a-0e908b42fe1e)"

We need the console container logs to debug. @stevekuznetsov do you know if the console container logs are available or if that is something we can capture?

Typically the container would crash loop either because:

  1. The console config is bad and fails validation on startup, or
  2. The console can't connect to the API server to determine if the template service broker is enabled.

I'm considering disabling (2) and trying to determine if the TSB is enabled by looking at the service classes in the browser.

cc @deads2k

still looking into it but can this new health check be failing?

I don't think it's the health check. The console was periodically flaking before that change. Would that result in CrashLoopBackOff status?

I see this failure permanently blocking 3.7

We shouldn't be trying to install the console in the 3.7 branch :( That will always fail.

@deads2k
Copy link
Contributor

deads2k commented Feb 12, 2018

I'm considering disabling (2) and trying to determine if the TSB is enabled by looking at the service classes in the browser.

You could also update to not fail if the check fails on a non-404. The error ought to be in your logs though.

@spadgett
Copy link
Member

I plan to

  1. Remove or relax the TSB check so it won't crash loop the pod. I'd like to do this check on the client anyway so that we catch when TSB is enabled after the console pod starts.
  2. Add some additional logging to openshift-ansible when the console install fails like the tail of the container log to make these problems easier to troubleshoot.

@spadgett
Copy link
Member

The five PRs to change the TSB discovery are

openshift/origin-web-catalog#642
openshift/origin-web-console#2800
openshift/origin-web-console-server#31
#18580
openshift/openshift-ansible#7120

It's not clear if this is the cause of the flake without container logs, but it's a change we wanted to make anyway and makes things simpler.

@spadgett
Copy link
Member

We won't be able to confidently know we've fixed this or troubleshoot future problems without having the console pod logs when install fails. @stevekuznetsov can you help with that?

/assign @stevekuznetsov

@spadgett
Copy link
Member

I have an openshift-ansible PR open to log more details when console install fails:

openshift/openshift-ansible#7132

@stevekuznetsov
Copy link
Contributor

@sdodson do we want to grab logs in the installer if this fails or do we want this only for the CI?

@spadgett
Copy link
Member

do we want to grab logs in the installer if this fails or do we want this only for the CI?

I'll defer to @sdodson but it might make sense just to add it to the installer. This would help anyone troubleshoot console install failures. For instance, if I set an incorrect openshift_web_console_prefix it would be clear that the install failed because the image couldn't be pulled. Today I'd just see the message in the description with no hints on how to troubleshoot.

@spadgett
Copy link
Member

I don't see any instances of this in snowstorm in the last 48 hours. @mfojtik let me know if I'm looking at it correctly. Note that openshift/origin-web-console-server#31 recently merged, which could have fixed the problem.

openshift/openshift-ansible#7132 is in the merge queue and should give us more detail if it happens again.

@spadgett
Copy link
Member

openshift/openshift-ansible#7108 also disables the console install for releases before 3.9.

Closing this issue.

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/web kind/test-flake Categorizes issue or PR as related to test flakes. priority/P0 sig/user-interface
Projects
None yet
Development

No branches or pull requests

6 participants