Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hack/test-cmd.sh:107: executing 'oc login' fails with "error: Missing configuration" #17574

Closed
php-coder opened this issue Dec 4, 2017 · 14 comments
Assignees
Labels
component/cli kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1

Comments

@php-coder
Copy link
Contributor

hack/test-cmd.sh has many failures, but looks like the most important is this:

github.com/openshift/origin/test/cmd/cmd-admin-namespace-setup hack/test-cmd.sh:107: executing 'oc login --server='https://172.17.0.2:28443' --certificate-authority='/tmp/openshift/init/openshift.local.config/master/ca.crt' -u test-user -p anything' expecting success 0.21s

go run hack/e2e.go -v -test --test_args='--ginkgo.focus=github\.com\/openshift\/origin\/test\/cmd\/cmd\-admin\-namespace\-setup\shack\/test\-cmd\.sh\:107\:\sexecuting\s\'oc\slogin\s\-\-server\=\'https\:\/\/172\.17\.0\.2\:28443\'\s\-\-certificate\-authority\=\'\/tmp\/openshift\/init\/openshift\.local\.config\/master\/ca\.crt\'\s\-u\stest\-user\s\-p\sanything\'\sexpecting\ssuccess$'

=== BEGIN TEST CASE ===
hack/test-cmd.sh:107: executing 'oc login --server='https://172.17.0.2:28443' --certificate-authority='/tmp/openshift/init/openshift.local.config/master/ca.crt' -u test-user -p anything' expecting success
FAILURE after 0.209s: hack/test-cmd.sh:107: executing 'oc login --server='https://172.17.0.2:28443' --certificate-authority='/tmp/openshift/init/openshift.local.config/master/ca.crt' -u test-user -p anything' expecting success: the command returned the wrong error code
There was no output from the command.
Standard error from the command:
error: Missing configuration - verify you have provided the correct host and port and that the server is currently running.
error: Missing configuration
=== END TEST CASE ===

The log is here: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17512/test_pull_request_origin_cmd/6902/

@php-coder
Copy link
Contributor Author

This is a simple race, we need to wait unit server will be ready.

@php-coder
Copy link
Contributor Author

It 100% reproducible on master locally (just run hack/test-cmd.sh). I also was able to fix it:

-os::cmd::expect_success "oc login --server='${KUBERNETES_MASTER}' --certificate-authority='${MASTER_CONFIG_DIR}/ca.crt' -u test-user -p anything"
+os::cmd::try_until_success "oc login --server='${KUBERNETES_MASTER}' --certificate-authority='${MASTER_CONFIG_DIR}/ca.crt' -u test-user -p anything"

The question is: would such a patch will be accepted? This race also signalizes that OpenShift server became slowly. I wasn't able to quickly identify a cause unfortunately.

CC @stevekuznetsov

@php-coder
Copy link
Contributor Author

I submitted https://github.com/RangelReale/osincli/issues/6 to improve the error reporting here.

@php-coder
Copy link
Contributor Author

So far I found the following:

  1. the error "Missing configuration" is happening here:
    func (c *Client) initialize() error {
    if c.config.ClientId == "" || c.config.AuthorizeUrl == "" ||
    c.config.TokenUrl == "" || c.config.RedirectUrl == "" {
    return errors.New("Missing configuration")
    }

    I printed the values of required members and get the following output:
clientId = openshift-challenging-client, authorizeUrl = , tokenUrl = , RedirectUrl = /oauth/token/implicit

So, the caller haven't fully configured a config (authorizeUrl and tokenUrl are empty).

  1. here is the caller:

    if o.OsinConfig == nil {
    if err := o.SetDefaultOsinConfig(); err != nil {
    return "", err
    }
    }
    client, err := osincli.NewClient(o.OsinConfig)
    if err != nil {
    return "", err
    }

    During debugging I found that SetDefaultOsinConfig() is being invoked. It means that this method is offender.

  2. The SetDefaultOsinConfig() function:

    func (o *RequestTokenOptions) SetDefaultOsinConfig() error {
    if o.OsinConfig != nil {
    return fmt.Errorf("osin config is already set to: %#v", *o.OsinConfig)
    }
    // get the OAuth metadata from the server
    rt, err := restclient.TransportFor(o.ClientConfig)
    if err != nil {
    return err
    }
    resp, err := request(rt, strings.TrimRight(o.ClientConfig.Host, "/")+oauthMetadataEndpoint, nil)
    if err != nil {
    return err
    }
    defer resp.Body.Close()
    metadata := &util.OauthAuthorizationServerMetadata{}
    if err := json.NewDecoder(resp.Body).Decode(metadata); err != nil {
    return err
    }
    // use the metadata to build the osin config
    config := &osincli.ClientConfig{
    ClientId: openShiftCLIClientID,
    AuthorizeUrl: metadata.AuthorizationEndpoint,
    TokenUrl: metadata.TokenEndpoint,
    RedirectUrl: util.OpenShiftOAuthTokenImplicitURL(metadata.Issuer),
    }
    if !o.TokenFlow && sets.NewString(metadata.CodeChallengeMethodsSupported...).Has(pkce_s256) {
    if err := osincli.PopulatePKCE(config); err != nil {
    return err
    }
    }
    o.OsinConfig = config
    return nil
    }

    Here, we can see that authorizeUrl and tokenUrl are being filled from the response to the /.well-known/oauth-authorization-server resource.

I'm not familiar with this code, but by my opinion, in order to fix this issue, we should update SetDefaultOsinConfig() to either wait until we get non-empty data, or perhaps inspect the response body and re-try, or inspect the response code.

Also we could update a shell function that waits when server will be ready by also polling /.well-known/oauth-authorization-server endpoint (but this could hide other issues like we have)

I'm adding @enj to copy, because he added this code and probably could give an advice.

@php-coder
Copy link
Contributor Author

php-coder commented Dec 4, 2017

or inspect the response code

Yes, there is a problem, when we invoke request() method it returns 403 error (Forbidden) because some RBAC initialization is still in progress. Later, if we invoke it again, the data will be available. We should add check for a response code here at least.

@php-coder
Copy link
Contributor Author

request() method it returns 403 error (Forbidden) because some RBAC initialization is still in progress.

@simo5 @enj BTW do we know why initialization is slower than it was?

@simo5
Copy link
Contributor

simo5 commented Dec 4, 2017

Probably because k8s RBAc forces reconciliation on each start

@enj
Copy link
Contributor

enj commented Dec 4, 2017

@php-coder feel free to add a check to SetDefaultOsinConfig that errors out if the response code is not 200. Do not add any retry logic.

The test scripts wait for the master to report its healthy, but that does not mean the RBAC cache is ready (only that the post start hook completed).

@deads2k do we need another post start hook for such things? Or I suppose we could add further checks in hack/lib/start.sh and test/extended/alternate_launches.sh that try to guess if the RBAC bits are ready for use.

@stevekuznetsov
Copy link
Contributor

We already have to wait for the kubernetes service to be running, so adding another thing to wait for is not the end of the world ... just terrible UX. Server startup is generally an edge case though that I wouldn't expect customers to hit so realistically let's just make the test scripts more robust.

@php-coder
Copy link
Contributor Author

@enj @deads2k @simo5 When we're waiting for master, we query oc get --raw /healthz/ready. WDYT about modifying /healthz/ready so it will take into account all this RBAC reconciliation things?

@php-coder
Copy link
Contributor Author

@php-coder feel free to add a check to SetDefaultOsinConfig that errors out if the response code is not 200.

#17606 has been submitted.

@enj
Copy link
Contributor

enj commented Dec 5, 2017

When we're waiting for master, we query oc get --raw /healthz/ready. WDYT about modifying /healthz/ready so it will take into account all this RBAC reconciliation things?

@php-coder it already does that. But just because reconciliation finished does not mean that the RBAC shared informer cache that backs the authorizer is ready to process requests. Running a curl in a loop until it says 200 for /.well-known/oauth-authorization-server could work as a proxy (access to that URL is protected by RBAC but anymore is allowed to see it since its part of the discovery rule). oc get --raw is not a valid way to check because it will be running as part of the system:masters group and will bypass the RBAC authorizer (@deads2k #17414 probably made this start flaking by skipping the RBAC authorizer).

@stevekuznetsov
Copy link
Contributor

@php-coder adding a curl for /.well-known/oauth-authorization-server would be appropriate.

@bparees
Copy link
Contributor

bparees commented Dec 5, 2017

hitting this on every run of #17314

openshift-merge-robot added a commit that referenced this issue Dec 5, 2017
…-logic

Automatic merge from submit-queue.

Poll for server start using impersonation

When we poll for server start on `/healthz` and/or `/healthz/ready` with
the administrative `$KUBECONFIG` we do so as part of the
`system:masters` group and therefore the response fromt these endpoints
will succeed before the RBAC system is initialized as we bypass it.
Using impersonation here allows the poll on these endpoints to not only
wait for the server to start but also to wait for the RBAC
initialization.

Signed-off-by: Steve Kuznetsov <[email protected]>

/cc @liggitt @deads2k @enj 
Fixes #17574
openshift-merge-robot added a commit that referenced this issue Dec 6, 2017
Automatic merge from submit-queue (batch tested with PRs 17217, 17597, 17606).

Fail fast when request to /.well-known/oauth-authorization-server fails

Improves error reporting. See here for more details: #17574 (comment)

Prior this change `hack/test-cmd.sh` could fail with the error:

> error: Missing configuration - verify you have provided the correct host and port and that the server is currently running.

Now, the message has more details:

> error: couldn't get https://10.34.129.148:28443/.well-known/oauth-authorization-server: unexpected status 403 were returned - verify you have provided the correct host and port and that the server is currently running.

PTAL @enj @simo5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/cli kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1
Projects
None yet
Development

No branches or pull requests

7 participants