Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature:Builds][timing] capture build stages and durations should record build stages and durations for s2i [Suite:openshift/conformance/parallel] 35s #17897

Closed
deads2k opened this issue Dec 20, 2017 · 20 comments
Assignees
Labels
component/imageregistry kind/test-flake Categorizes issue or PR as related to test flakes. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/P1 sig/developer-experience

Comments

@deads2k
Copy link
Contributor

deads2k commented Dec 20, 2017

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17874/test_pull_request_origin_extended_conformance_gce/13345/

[Feature:Builds][timing] capture build stages and durations should record build stages and durations for s2i [Suite:openshift/conformance/parallel] 35s

/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/builds/build_timing.go:54
Expected
    <bool>: false
to be true
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/builds/build_timing.go:73
@deads2k deads2k added kind/test-flake Categorizes issue or PR as related to test flakes. sig/developer-experience labels Dec 20, 2017
@bparees
Copy link
Contributor

bparees commented Dec 21, 2017

the test created the imagestream:

Dec 20 13:58:29.934: INFO: Running 'oc create --config=/tmp/extended-test-build-timing-ll26s-r59kk-user.kubeconfig --namespace=extended-test-build-timing-ll26s-r59kk -f /tmp/fixture-testdata-dir215236524/test/extended/testdata/builds/build-timing/test-is.json'
imagestream "test" created

the registry claims it doesn't exist:

172.16.6.1 - - [20/Dec/2017:13:58:44 +0000] "PUT /v1/repositories/extended-test-build-timing-ll26s-r59kk/test/ HTTP/1.1" 404 19 "" "docker/1.12.6 go/go1.8.3 kernel/3.10.0-693.11.1.el7.x86_64 os/linux arch/amd64 UpstreamClient(go-dockerclient)"

But even then i would have expected the registry to create the imagestream, not 404 the put?

@legionus @dmage @miminar

@dmage
Copy link
Contributor

dmage commented Dec 21, 2017

We do not implement v1 API. I don't know why the client has decided to use this version of API.

@bparees
Copy link
Contributor

bparees commented Dec 21, 2017

were there any other PUT /v1 (or other /v1 api invocations in general) in the registry log? The build just does a docker push via the node's docker daemon, so this would imply that for some reason the docker daemon decided to use the v1 api? Perhaps an issue w/ the docker daemon version in our AMI or with the docker daemon configuration?

@bparees
Copy link
Contributor

bparees commented Dec 21, 2017

(since we didn't see all the pushes fail, it is particularly strange..I assume there were some PUT /v2 entries in the registry log)

@dmage
Copy link
Contributor

dmage commented Jan 10, 2018

time="2017-12-20T13:58:37.91693671Z" level=info msg="listening on :5000, tls" go.version=go1.8.3 instance.id=eb4c4b0b-c946-43ef-a191-3fd3eea9a715
172.16.4.1 - - [20/Dec/2017:13:58:44 +0000] "GET /healthz HTTP/2.0" 200 0 "" "kube-probe/."
... REDACTED ...
172.16.6.1 - - [20/Dec/2017:13:58:44 +0000] "GET /v1/_ping HTTP/1.1" 404 19 "" "docker/1.12.6 go/go1.8.3 kernel/3.10.0-693.11.1.el7.x86_64 os/linux arch/amd64 UpstreamClient(go-dockerclient)"

We don't get any requests until the first /healthz request.

Dec 20 13:58:54.540: INFO: router-1-deploy Pending [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2017-12-20 13:55:05 +0000 UTC Unschedulable 0/4 nodes are available: 4 MatchNodeSelector.}]

That's looks like the reason of delay.

@bparees do we want to wait http(s?)://docker-registry.default.svc:5000/healthz before pushing?

@bparees
Copy link
Contributor

bparees commented Jan 10, 2018

@bparees do we want to wait http(s?)://docker-registry.default.svc:5000/healthz before pushing?

I think we want to add a check for /healthz to the extended test framework setup so we don't start running any tests until its passing.

@dmage
Copy link
Contributor

dmage commented Jan 10, 2018

Looks like the registry is started by the ansible playbook and we skip waiting for the registry:

TASK [openshift_hosted : Wait for pod (Registry)] ******************************
Wednesday 20 December 2017  13:55:25 +0000 (0:00:00.055)       0:15:49.500 **** 
included: /usr/share/ansible/openshift-ansible/roles/openshift_hosted/tasks/wait_for_pod.yml for ci-prtest-5a37c28-13345-ig-m-lcr7

TASK [openshift_hosted : Ensure OpenShift pod correctly rolls out (best-effort today)] ***
Wednesday 20 December 2017  13:55:25 +0000 (0:00:00.062)       0:15:49.563 **** 
skipping: [ci-prtest-5a37c28-13345-ig-m-lcr7] => (item={u'namespace': u'default', u'name': u'docker-registry'}) 

TASK [openshift_hosted : Determine the latest version of the OpenShift pod deployment] ***
Wednesday 20 December 2017  13:55:25 +0000 (0:00:00.045)       0:15:49.609 **** 
skipping: [ci-prtest-5a37c28-13345-ig-m-lcr7] => (item={u'namespace': u'default', u'name': u'docker-registry'}) 

TASK [openshift_hosted : Poll for OpenShift pod deployment success] ************
Wednesday 20 December 2017  13:55:25 +0000 (0:00:00.044)       0:15:49.653 **** 
skipping: [ci-prtest-5a37c28-13345-ig-m-lcr7] => (item=[{u'namespace': u'default', u'name': u'docker-registry'}, {'skipped': True, '_ansible_no_log': False, 'skip_reason': u'Conditional result was False', '_ansible_item_result': True, 'item': {u'namespace': u'default', u'name': u'docker-registry'}, 'changed': False}]) 

@bparees bparees assigned sdodson and stevekuznetsov and unassigned dmage Jan 10, 2018
@bparees
Copy link
Contributor

bparees commented Jan 10, 2018

@stevekuznetsov @sdodson please decide between yourselves whether the playbook should address this, or elsewhere in the test framework startup process. One of them needs to wait for the registry to pass a healthcheck.

@stevekuznetsov
Copy link
Contributor

Pretty sure the playbooks had that logic -- I remember adding it

@sdodson
Copy link
Member

sdodson commented Jan 10, 2018

It's there, there was a recent change that may have broken it.
/assign michaelgugino

@sdodson
Copy link
Member

sdodson commented Jan 10, 2018

@dmage do you have a link to that job? nevermind, that's from the same job as the original post.

@dmage
Copy link
Contributor

dmage commented Jan 10, 2018

@michaelgugino
Copy link
Contributor

I don't see openshift-ansible being cloned in that job anywhere. It appears that it's baked into some image, and that image is out of date.

@sdodson
Copy link
Member

sdodson commented Jan 10, 2018

We've verified that current openshift-ansible install ci jobs are waiting for the registry and router to have rolled out. Marking this as closed until there's a more recent example.

@sdodson sdodson closed this as completed Jan 10, 2018
@bparees
Copy link
Contributor

bparees commented Jan 10, 2018

awesome, thank you @sdodson

@aveshagarwal
Copy link
Contributor

@bparees
Copy link
Contributor

bparees commented Jan 20, 2018

push started at 22:00:20:

2018-01-19T22:00:20.269142024Z Pushing image docker-registry.default.svc:5000/extended-test-build-timing-gv8d9-rrg2j/test:latest ...

registry logs indicate it was just starting up at 22:00:11:

Jan 19 22:00:28.927: INFO: Running 'oc logs --config=/tmp/cluster-admin.kubeconfig --namespace=default dc/docker-registry --since=38.927649206s'
Jan 19 22:00:29.383: INFO: time="2018-01-19T22:00:11Z" level=warning msg="Ignoring unrecognized environment variable REGISTRY_CONSOLE_PORT" 
time="2018-01-19T22:00:11Z" level=warning msg="Ignoring unrecognized environment variable REGISTRY_CONSOLE_PORT_9000_TCP" 
time="2018-01-19T22:00:11Z" level=warning msg="Ignoring unrecognized environment variable REGISTRY_CONSOLE_PORT_9000_TCP_ADDR" 
time="2018-01-19T22:00:11Z" level=warning msg="Ignoring unrecognized environment variable REGISTRY_CONSOLE_PORT_9000_TCP_PORT" 
time="2018-01-19T22:00:11Z" level=warning msg="Ignoring unrecognized environment variable REGISTRY_CONSOLE_PORT_9000_TCP_PROTO" 
time="2018-01-19T22:00:11Z" level=warning msg="Ignoring unrecognized environment variable REGISTRY_CONSOLE_SERVICE_HOST" 
time="2018-01-19T22:00:11Z" level=warning msg="Ignoring unrecognized environment variable REGISTRY_CONSOLE_SERVICE_PORT" 
time="2018-01-19T22:00:11Z" level=warning msg="Ignoring unrecognized environment variable REGISTRY_CONSOLE_SERVICE_PORT_REGISTRY_CONSOLE" 
time="2018-01-19T22:00:11Z" level=info msg="DEPRECATED: \"OPENSHIFT_DEFAULT_REGISTRY\" is deprecated, use the 'REGISTRY_OPENSHIFT_SERVER_ADDR' instead" 
time="2018-01-19T22:00:11.560767956Z" level=info msg="start registry" distribution_version=v2.6.2+unknown go.version=go1.9.2 instance.id=441efb02-5639-4779-bcc8-1f9f16840753 kubernetes_version="v1.7.6+$Format:%h$" openshift_version=v3.9.0-alpha.3+9db0f3b 

so unless the registry was restarted, it does seem like something probably didn't wait for it to be up before our tests started running.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 20, 2018
@stevekuznetsov
Copy link
Contributor

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/imageregistry kind/test-flake Categorizes issue or PR as related to test flakes. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/P1 sig/developer-experience
Projects
None yet
Development

No branches or pull requests

10 participants