Retrying when attempting to stream build logs #19695

adambkaplan · 2018-05-11T20:42:33Z

adambkaplan

worth consideration - a configurable timeout if --wait or --follow are set to true

adambkaplan · 2018-05-11T20:43:20Z

pkg/oc/cli/cmd/startbuild.go

 var (
+	streamLogWaitDuration = 10 * time.Second


used var here so I could run unit tests.

adambkaplan · 2018-05-11T20:45:24Z

pkg/oc/cli/cmd/startbuild.go

+	for i := 0; i < streamLogAttempts; i++ {
+		rd, logErr := o.BuildLogClient.Logs(build.Name, opts).Stream()
+		if logErr != nil {
+			err = ocerrors.NewError("error getting logs").WithCause(logErr)


unclear why, but the ocerrors.IsTimeoutErr check always returned false, so I removed it.

adambkaplan · 2018-05-11T20:46:25Z

pkg/oc/cli/cmd/startbuild.go

-			break
+		err = o.streamBuildLogs(newBuild)
+		if err != nil {
+			fmt.Fprintf(o.ErrOut, "Failed to stream the build logs - to view the logs, run oc logs build/%s after the build completes.\n", newBuild.Name)


add help text with work-around.

doesn't have to be after the build completes, as long as the build is in a running state we should be able to get logs from it (and optionally follow the log stream)

adambkaplan · 2018-05-11T20:50:44Z

/assign @soltysh
cc @bparees

bparees · 2018-05-11T23:12:19Z

pkg/oc/cli/cmd/startbuild.go

+		if logErr != nil {
+			err = ocerrors.NewError("error getting logs").WithCause(logErr)
+			// retry after waiting a fixed period of time
+			time.Sleep(streamLogWaitDuration)


i don't think we want to wait here. the log api already waits 10s for the build to start running(right?), so i'd say just call it again immediately and let it do the waiting.

bparees · 2018-05-11T23:15:54Z

here's the behavior i think we want:

we should increase the timeout that waits for the build to start, in the logs api, 10s is too short. ideally it should respect a --request-timeout provided by the client, if that is easy enough to wire through. (oc start-build should already have that flag inherited i believe)
if you invoke w/ just "--follow" we should retain the existing behavior which, i believe, is that we make one call to the logs api and if it times out, we give up and fail.
if you invoke w/ --follow and --wait we should retry the logs api call indefinitely (because oc start-build isn't going to terminate until the build completes anyway). Barring unexpected errors from the logs api of course.

wdyt?

adambkaplan · 2018-05-14T15:29:23Z

reply:

--request-timeout appears to be an inherited kube flag, and I believe we can take advantage of this. May need to dig a bit further to see if I can push that setting down to the REST client (perhaps making 60s the default?)
I think failing with the help message makes sense, regardless of error source (timeout, IO, etc.)
if both flags are invoked, should we print an error message every time the stream fails?

bparees · 2018-05-14T15:38:57Z

if both flags are invoked, should we print an error message every time the stream fails?

only if some level of debug is enabled.

adambkaplan · 2018-05-15T16:28:40Z

@bparees updates from my investigation:

--request-timeout is supported out of the box, but this applies a timeout to the underlying HTTP client. It does appear that this value is passed as the timeout parameter on each REST request, though I do not know how that is interpreted by default on the API server.
We could add an independent timeout field to the BuildLogOptions object (ex: BuildTimeout), but this would require an API update. Worth doing this?

bparees · 2018-05-15T17:14:58Z

We could add an independent timeout field to the BuildLogOptions object (ex: BuildTimeout), but this would require an API update. Worth doing this?

no i don't think so.

let's just bump the existing 10s timeout to 30s. Between that and adding the retry handling i think we'll cover 99% of use cases.

bparees · 2018-05-15T17:16:27Z

pkg/build/registry/buildlog/rest.go

@@ -38,7 +38,7 @@ type REST struct {
 	getSimpleLogsFn func(podNamespace, podName string, logOpts *kapi.PodLogOptions) (runtime.Object, error)
 }

-const defaultTimeout time.Duration = 10 * time.Second
+const defaultTimeout time.Duration = 60 * time.Second


let's godoc what this timeout is (it is how long we wait for the build to be running before giving up on getting logs for a build)

bparees · 2018-05-15T17:28:28Z

pkg/oc/cli/cmd/startbuild.go

-			break
+		err = o.streamBuildLogs(newBuild)
+		if err != nil {
+			fmt.Fprintf(o.ErrOut, "Failed to stream the build logs - to view the logs, run oc logs build/%s\n", newBuild.Name)


should print the error too

bparees · 2018-05-15T17:32:24Z

@openshift/cli-review ptal

adambkaplan · 2018-05-15T19:27:35Z

/retest

* Add retry when attempting to stream build logs. * Increase server-side build wait timeout to 30s. Fixes bug 1575990

soltysh

/approve
One question, but overall lgtm. I'll leave final approval for @bparees

soltysh · 2018-05-17T08:30:04Z

pkg/oc/cli/cmd/startbuild.go

-			break
+		err = o.streamBuildLogs(newBuild)
+		if err != nil {
+			fmt.Fprintf(o.ErrOut, "Failed to stream the build logs - to view the logs, run oc logs build/%s\nError: %v\n", newBuild.Name, err)
 		}


The next if just struck me, why we check o.Follow in the next if block if it's meant only for waiting for build completion? Following should be distinct from waiting from completion, no?

yeah good point, i think we should remove the check for o.Follow below, such that if the user only specified "--follow" and we fail to get the logs (due to timeout) the command simply exits.

that said, it's a change in behavior from the current behavior which is that even if you only specify "--follow" the command will still wait for the build to complete, even if it fails to get the logs.

But i would argue the existing behavior is a bug. If you want the command to wait, you pass --wait. If you pass --follow, you don't get wait semantics.

I too agree that the existing behavior is a bug (if follow fails, the command hangs), and this will now have higher user impact because the follow error message includes the work-around instruction. I'll remove the o.Follow check.

adambkaplan · 2018-05-17T18:15:24Z

/retest

adambkaplan · 2018-05-22T15:52:04Z

@soltysh @bparees needs lgtm

bparees · 2018-05-22T17:03:13Z

/lgtm

bparees · 2018-05-22T17:03:25Z

/approve

openshift-ci-robot · 2018-05-22T17:03:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adambkaplan, bparees, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/build/OWNERS~~ [bparees,soltysh]
~~pkg/oc/OWNERS~~ [bparees,soltysh]
~~test/extended/OWNERS~~ [bparees]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Approved (see comments)

openshift-bot · 2018-05-22T23:51:02Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2018-05-23T07:35:59Z

@adambkaplan: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/openshift-jenkins/extended_builds	`c848ef0`	link	`/test extended_builds`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot requested review from juanvallejo and soltysh May 11, 2018 20:42

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 11, 2018

adambkaplan commented May 11, 2018

View reviewed changes

openshift-ci-robot assigned soltysh May 11, 2018

bparees reviewed May 11, 2018

View reviewed changes

bparees previously requested changes May 15, 2018

View reviewed changes

bparees self-assigned this May 15, 2018

Improve resilience of oc start-build log streaming

7a1bf39

* Add retry when attempting to stream build logs. * Increase server-side build wait timeout to 30s. Fixes bug 1575990

adambkaplan force-pushed the bugfix/build-log-streaming branch from f944ca6 to 7a1bf39 Compare May 15, 2018 19:33

bparees mentioned this pull request May 15, 2018

new_build timeout - can it be increased ? #19700

Closed

soltysh approved these changes May 17, 2018

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 17, 2018

oc: start-build waits only if --wait flag is set

c848ef0

openshift-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 17, 2018

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 22, 2018

openshift-merge-robot merged commit 59ded31 into openshift:master May 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrying when attempting to stream build logs #19695

Retrying when attempting to stream build logs #19695

adambkaplan commented May 11, 2018

adambkaplan left a comment

adambkaplan May 11, 2018

adambkaplan May 11, 2018

adambkaplan May 11, 2018

bparees May 11, 2018

adambkaplan commented May 11, 2018

bparees May 11, 2018

bparees commented May 11, 2018

adambkaplan commented May 14, 2018

bparees commented May 14, 2018

adambkaplan commented May 15, 2018

bparees commented May 15, 2018

bparees May 15, 2018

bparees May 15, 2018

bparees commented May 15, 2018

adambkaplan commented May 15, 2018

soltysh left a comment

soltysh May 17, 2018

bparees May 17, 2018

adambkaplan May 17, 2018

adambkaplan commented May 17, 2018

adambkaplan commented May 22, 2018

bparees commented May 22, 2018

bparees commented May 22, 2018

openshift-ci-robot commented May 22, 2018

openshift-bot commented May 22, 2018

openshift-ci-robot commented May 23, 2018

Retrying when attempting to stream build logs #19695

Retrying when attempting to stream build logs #19695

Conversation

adambkaplan commented May 11, 2018

adambkaplan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adambkaplan commented May 11, 2018

Choose a reason for hiding this comment

bparees commented May 11, 2018

adambkaplan commented May 14, 2018

bparees commented May 14, 2018

adambkaplan commented May 15, 2018

bparees commented May 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bparees commented May 15, 2018

adambkaplan commented May 15, 2018

soltysh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adambkaplan commented May 17, 2018

adambkaplan commented May 22, 2018

bparees commented May 22, 2018

bparees commented May 22, 2018

openshift-ci-robot commented May 22, 2018

openshift-bot commented May 22, 2018

openshift-ci-robot commented May 23, 2018