Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature:Prometheus][Feature:Builds] Prometheus when installed to the cluster should start and expose a secured proxy and verify build metrics [Suite:openshift/conformance/paralle] #17694

Closed
enj opened this issue Dec 8, 2017 · 11 comments
Assignees
Labels
component/build kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1

Comments

@enj
Copy link
Contributor

enj commented Dec 8, 2017

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17617/test_pull_request_origin_extended_conformance_gce/12491/

/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:108
Expected
    <map[string]error | len:1>: {
        "openshift_build_active_time_seconds": {
            s: "query openshift_build_active_time_seconds for tests []prometheus.metricTest{prometheus.metricTest{labels:map[string]string{\"name\":\"frontend-1\"}, greaterThan:true, value:0, success:false}} had results {\"status\":\"success\",\"data\":{\"resultType\":\"vector\",\"result\":[{\"metric\":{\"__name__\":\"openshift_build_active_time_seconds\",\"instance\":\"10.142.0.2:8444\",\"job\":\"kubernetes-controllers\",\"name\":\"mydockertest-1\",\"namespace\":\"extended-test-build-valuefrom-w6mm6-dfgl8\",\"phase\":\"Running\"},\"value\":[1512582511.727,\"1512582414\"]}]}}",
        },
    }
to be empty
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:171
@enj enj added the kind/test-flake Categorizes issue or PR as related to test flakes. label Dec 8, 2017
@bparees
Copy link
Contributor

bparees commented Dec 8, 2017

@gabemontero is this another race where something else produced metrics in the cluster that we didn't expect to see in our test?

@gabemontero
Copy link
Contributor

no it is something else @bparees

the test logic should have ignored the mydockertest-1 build that is in the result data in the error message in the description

and I see the frontend-1 build running and completing during the test's polling window

and I see a prometheus-0 pod up and running (though I thought prometheus was converted to a stateful set ... though maybe that entails a pod under the covers)

I think we need more debug in our ext test on the failure
a) perhaps dump the intermediate results scan results
b) and similar to our jenkins testing, dump the prometheus pod, and see if there was any issue with it scrapping for data from the build controller

@gabemontero
Copy link
Contributor

confirmed the stateful set still maps to a pod (instantiated the example prometheus template manually)

@bparees
Copy link
Contributor

bparees commented Dec 12, 2017

(though I thought prometheus was converted to a stateful set ... though maybe that entails a pod under the covers)

it does. pods are the fundamental unit of any workload. (well, containers are, but you can't have a container w/o a pod)

openshift-merge-robot added a commit that referenced this issue Dec 14, 2017
Automatic merge from submit-queue (batch tested with PRs 17734, 17550, 17647, 17761, 17564).

add debug for build prometheus extended test failures

debug for #17694

@openshift/sig-developer-experience fyi / ptal
@gabemontero
Copy link
Contributor

PR with fix merged ... not sure why bot did not close this.

@bparees
Copy link
Contributor

bparees commented Dec 21, 2017

@gabemontero
#17717 (comment)

you had the wrong issue referenced.

@bparees
Copy link
Contributor

bparees commented Dec 21, 2017

(well maybe not wrong but it looks like we had two open?)

@gabemontero
Copy link
Contributor

the failure in https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17783/test_pull_request_origin_extended_conformance_install/5804/ was the same test as this issue is tracking, but the failure was different than the symptoms originally reported in this issue, and not related to the prometheus build stats being verified.

The sample app build we launched started and failed quickly (deps download glitch I believe), and the active stat query failed.

Separate from the original, precise symptoms which lead me to leave this issue open,
it does beg I question I've asked myself before ... automated testing for the active stat has been tricky, even when developing it, given the batching intervals for prometheus querying the buidl controller. How many flakes need to occur for the active stats to cry uncle and remove it?

Pondering ....

@gabemontero
Copy link
Contributor

If I do delete the active build query, I'll do it in a separate item. Reclosing this issue per original symptom.

@gabemontero
Copy link
Contributor

separate item - #18193

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/build kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1
Projects
None yet
Development

No branches or pull requests

4 participants