Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fail] [k8s.io] Pods [It] should support remote command execution over websockets [Suite:openshift/conformance/parallel] [Suite:k8s] #18726

Closed
gabemontero opened this issue Feb 22, 2018 · 21 comments · Fixed by #18755

Comments

@gabemontero
Copy link
Contributor

See https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_conformance_install/7855/consoleFull#-169259343956c60d7be4b02b88ae8c268b

The failed test highlighted that Feb 22 20:21:07.165: Got message from server that didn't start with channel 1 (STDOUT): [2]

The pod for the test was in fact running.

I saw no errors in the node log for the pod prior to the namespace getting tore down.

@gabemontero gabemontero added priority/P1 kind/test-flake Categorizes issue or PR as related to test flakes. labels Feb 22, 2018
@gabemontero
Copy link
Contributor Author

@openshift/sig-kube-origin

@gabemontero
Copy link
Contributor Author

@sjenning fyi

@sosiouxme
Copy link
Member

@deads2k
Copy link
Contributor

deads2k commented Feb 23, 2018

Is this flaking or just straight out failing?

@deads2k
Copy link
Contributor

deads2k commented Feb 23, 2018

No successful runs in the last 18 hours. Bumping to p0.

@deads2k
Copy link
Contributor

deads2k commented Feb 23, 2018

The commit suggested by https://deck-ci.svc.ci.openshift.org/?job=test_branch_origin_extended_conformance_install doesn't make any sense. It must be a side-effect of something else.

@Kargakis @stevekuznetsov @derekwaynecarr there was a docker something in our ami, right?

@runcom
Copy link
Member

runcom commented Feb 23, 2018

@gabemontero gabemontero removed the kind/test-flake Categorizes issue or PR as related to test flakes. label Feb 23, 2018
@gabemontero gabemontero changed the title ext test flake [Fail] [k8s.io] Pods [It] should support remote command execution over websockets [Suite:openshift/conformance/parallel] [Suite:k8s] [Fail] [k8s.io] Pods [It] should support remote command execution over websockets [Suite:openshift/conformance/parallel] [Suite:k8s] Feb 23, 2018
@gabemontero
Copy link
Contributor Author

removed all "flake" indicators from this issue

@runcom
Copy link
Member

runcom commented Feb 23, 2018

@sjenning @deads2k @gabemontero could you guys assist us with narrowing down the issue to a docker issue? Not sure what in origin triggers that error.

@stevekuznetsov
Copy link
Contributor

Reverted the 1.13 AMI

@deads2k
Copy link
Contributor

deads2k commented Feb 23, 2018

@sjenning
Copy link
Contributor

I'm seeing this in the node log

Feb 23 14:29:07 ip-172-18-1-150.ec2.internal origin-node[26740]: I0223 14:29:07.964266   26740 server.go:800] GET /exec/e2e-tests-pods-wn695/pod-exec-websocket-e22e0362-18a5-11e8-beb9-0ea3dacd3950/main?command=cat&command=%2Fetc%2Fresolv.conf&error=1&output=1: (35.208338ms) 302 [[Go-http-client/1.1] 172.18.1.150:50886]

Seems that we are getting a 302 redirection from the endpoint.

@sjenning
Copy link
Contributor

sjenning commented Feb 23, 2018

I see that 302 on both versions of docker in my local cluster so that isn't it.

I have near zero knowledge about websockets, spdy, http/2, etc so this could take me a while to figure out. Anyone that knows more is welcome to figure it out faster. @smarterclayton ?

However, it seems that the test is reading the stream for STDERR with stream id 2 rather than STDOUT with stream 1 id as the test expects. The query params indicate the desire for both streams. I'm not sure how the test ensures that the STDOUT stream is the one that it is reading.

@sjenning
Copy link
Contributor

I do wonder if we just removed this line if it would resolve the situation since the test does not expect to read from STDERR
https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/test/e2e/common/pods.go#L499

@liggitt
Copy link
Contributor

liggitt commented Feb 23, 2018

The 302 is a redirect followed by the API server before returning anything to the client, and has to do with redirecting to what will eventually be a CRI endpoint on the node. For now, it is still served by the kubelet.

@liggitt
Copy link
Contributor

liggitt commented Feb 23, 2018

We are requesting both stdout and stderr output from running the command be sent to us, but we are not expecting to actually receive any stderr output from that command, which is why the test fails if it receives a message on the stderr channel. We should not change the test to ignore error output. I do wonder if docker changed to send zero byte writes to the standard error output stream. I could potentially see updating the test to ignore empty stderr channel messages.

@smarterclayton
Copy link
Contributor

Sorry I didn't respond, I think ignoring empty stderr channel messages is totally fine. We use them in other contexts and if we have websockets support ever we will deliver zero byte messages.

@runcom
Copy link
Member

runcom commented Feb 26, 2018

We are requesting both stdout and stderr output from running the command be sent to us, but we are not expecting to actually receive any stderr output from that command, which is why the test fails if it receives a message on the stderr channel. We should not change the test to ignore error output. I do wonder if docker changed to send zero byte writes to the standard error output stream. I could potentially see updating the test to ignore empty stderr channel messages.

is there any way to read what's in STDERR actually? @liggitt @sjenning

@liggitt
Copy link
Contributor

liggitt commented Feb 26, 2018

is there any way to read what's in STDERR actually? @liggitt @sjenning

the content of the message is logged, and is empty:

Feb 22 20:21:07.165: Got message from server that didn't start with channel 1 (STDOUT): [2]

that means the message contained exactly one byte (2) which indicated the stderr channel, and no other content

@sjenning
Copy link
Contributor

Opened PR kubernetes/kubernetes#60457 upstream

k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this issue Mar 20, 2018
Automatic merge from submit-queue (batch tested with PRs 60457, 60331, 54970, 58731, 60562). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

tests: e2e: empty msg from channel other than stdout should be non-fatal

Currently, if the exec websocket encounters a message that is not in the stdout stream, it immediately fails.  However it also currently requests the stderr steam in the query params.  There doesn't seem to be any guarantee that we don't get an empty message on the stderr stream.

Requesting the stderr stream in the query is desirable if, for some reason, something in the container fails and writes to stderr.

However, we do not need fail the test if we get an empty message on another stream.  If the message is not empty, then that _does_ indicate and error and we should fail.

This is the situation we are currently observing with docker 1.13 in the origin CI openshift/origin#18726

@derekwaynecarr @smarterclayton @gabemontero @liggitt @deads2k 

/sig node
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants