Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrying when attempting to stream build logs #19695

Merged

Conversation

adambkaplan
Copy link
Contributor

Fixes bug 1575990

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 11, 2018
Copy link
Contributor Author

@adambkaplan adambkaplan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth consideration - a configurable timeout if --wait or --follow are set to true

var (
streamLogWaitDuration = 10 * time.Second
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used var here so I could run unit tests.

for i := 0; i < streamLogAttempts; i++ {
rd, logErr := o.BuildLogClient.Logs(build.Name, opts).Stream()
if logErr != nil {
err = ocerrors.NewError("error getting logs").WithCause(logErr)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unclear why, but the ocerrors.IsTimeoutErr check always returned false, so I removed it.

break
err = o.streamBuildLogs(newBuild)
if err != nil {
fmt.Fprintf(o.ErrOut, "Failed to stream the build logs - to view the logs, run oc logs build/%s after the build completes.\n", newBuild.Name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add help text with work-around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't have to be after the build completes, as long as the build is in a running state we should be able to get logs from it (and optionally follow the log stream)

@adambkaplan
Copy link
Contributor Author

/assign @soltysh
cc @bparees

if logErr != nil {
err = ocerrors.NewError("error getting logs").WithCause(logErr)
// retry after waiting a fixed period of time
time.Sleep(streamLogWaitDuration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we want to wait here. the log api already waits 10s for the build to start running(right?), so i'd say just call it again immediately and let it do the waiting.

@bparees
Copy link
Contributor

bparees commented May 11, 2018

here's the behavior i think we want:

  1. we should increase the timeout that waits for the build to start, in the logs api, 10s is too short. ideally it should respect a --request-timeout provided by the client, if that is easy enough to wire through. (oc start-build should already have that flag inherited i believe)

  2. if you invoke w/ just "--follow" we should retain the existing behavior which, i believe, is that we make one call to the logs api and if it times out, we give up and fail.

  3. if you invoke w/ --follow and --wait we should retry the logs api call indefinitely (because oc start-build isn't going to terminate until the build completes anyway). Barring unexpected errors from the logs api of course.

wdyt?

@adambkaplan
Copy link
Contributor Author

reply:

  1. --request-timeout appears to be an inherited kube flag, and I believe we can take advantage of this. May need to dig a bit further to see if I can push that setting down to the REST client (perhaps making 60s the default?)
  2. I think failing with the help message makes sense, regardless of error source (timeout, IO, etc.)
  3. if both flags are invoked, should we print an error message every time the stream fails?

@bparees
Copy link
Contributor

bparees commented May 14, 2018

if both flags are invoked, should we print an error message every time the stream fails?

only if some level of debug is enabled.

@adambkaplan
Copy link
Contributor Author

@bparees updates from my investigation:

  1. --request-timeout is supported out of the box, but this applies a timeout to the underlying HTTP client. It does appear that this value is passed as the timeout parameter on each REST request, though I do not know how that is interpreted by default on the API server.
  2. We could add an independent timeout field to the BuildLogOptions object (ex: BuildTimeout), but this would require an API update. Worth doing this?

@bparees
Copy link
Contributor

bparees commented May 15, 2018

We could add an independent timeout field to the BuildLogOptions object (ex: BuildTimeout), but this would require an API update. Worth doing this?

no i don't think so.

let's just bump the existing 10s timeout to 30s. Between that and adding the retry handling i think we'll cover 99% of use cases.

@@ -38,7 +38,7 @@ type REST struct {
getSimpleLogsFn func(podNamespace, podName string, logOpts *kapi.PodLogOptions) (runtime.Object, error)
}

const defaultTimeout time.Duration = 10 * time.Second
const defaultTimeout time.Duration = 60 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's godoc what this timeout is (it is how long we wait for the build to be running before giving up on getting logs for a build)

break
err = o.streamBuildLogs(newBuild)
if err != nil {
fmt.Fprintf(o.ErrOut, "Failed to stream the build logs - to view the logs, run oc logs build/%s\n", newBuild.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should print the error too

@bparees bparees self-assigned this May 15, 2018
@bparees
Copy link
Contributor

bparees commented May 15, 2018

@openshift/cli-review ptal

@adambkaplan
Copy link
Contributor Author

/retest

* Add retry when attempting to stream build logs.
* Increase server-side build wait timeout to 30s.

Fixes bug 1575990
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
One question, but overall lgtm. I'll leave final approval for @bparees

break
err = o.streamBuildLogs(newBuild)
if err != nil {
fmt.Fprintf(o.ErrOut, "Failed to stream the build logs - to view the logs, run oc logs build/%s\nError: %v\n", newBuild.Name, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next if just struck me, why we check o.Follow in the next if block if it's meant only for waiting for build completion? Following should be distinct from waiting from completion, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good point, i think we should remove the check for o.Follow below, such that if the user only specified "--follow" and we fail to get the logs (due to timeout) the command simply exits.

that said, it's a change in behavior from the current behavior which is that even if you only specify "--follow" the command will still wait for the build to complete, even if it fails to get the logs.

But i would argue the existing behavior is a bug. If you want the command to wait, you pass --wait. If you pass --follow, you don't get wait semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too agree that the existing behavior is a bug (if follow fails, the command hangs), and this will now have higher user impact because the follow error message includes the work-around instruction. I'll remove the o.Follow check.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 17, 2018
@openshift-ci-robot openshift-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 17, 2018
@adambkaplan
Copy link
Contributor Author

/retest

@adambkaplan
Copy link
Contributor Author

@soltysh @bparees needs lgtm

@bparees
Copy link
Contributor

bparees commented May 22, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 22, 2018
@bparees
Copy link
Contributor

bparees commented May 22, 2018

/approve

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adambkaplan, bparees, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@adambkaplan adambkaplan dismissed bparees’s stale review May 22, 2018 17:50

Approved (see comments)

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 59ded31 into openshift:master May 23, 2018
@openshift-ci-robot
Copy link

@adambkaplan: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/extended_builds c848ef0 link /test extended_builds

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants