Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/tmp/deprovision.sh failed #13559

Closed
soltysh opened this issue Mar 28, 2017 · 21 comments
Closed

/tmp/deprovision.sh failed #13559

soltysh opened this issue Mar 28, 2017 · 21 comments
Assignees
Labels
area/infrastructure kind/test-flake Categorizes issue or PR as related to test flakes. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P2

Comments

@soltysh
Copy link
Contributor

soltysh commented Mar 28, 2017

Seen in https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_conformance_gce/635/console. The direct error seems to be related with this particular GKE failure:

ERROR: (gcloud.compute.instance-templates.delete) Some requests did not succeed:
 - Code: '385079975793035184'", "stdout": "", "stdout_lines": [], "warnings": []}
@stevekuznetsov
Copy link
Contributor

@smarterclayton when was the jUnit + ansible output stuff supposed to bubble up to the image being used? We don't have this logged.

@smarterclayton
Copy link
Contributor

Uh...

@soltysh
Copy link
Contributor Author

soltysh commented Jun 5, 2017

@soltysh
Copy link
Contributor Author

soltysh commented Jun 5, 2017

Deleted [https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci/global/instanceTemplates/ci-prtest-5a37c28-2729-instance-template-master].
ERROR: (gcloud.compute.firewall-rules.delete) Some requests did not succeed:
 - Internal Error

Looks like internal GCE error, not sure if we're able to do with it.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jun 6, 2017 via email

@soltysh
Copy link
Contributor Author

soltysh commented Jun 6, 2017

Not cool :/

@soltysh
Copy link
Contributor Author

soltysh commented Jun 20, 2017

@stevekuznetsov
Copy link
Contributor

Today all shell tasks are generated from the template at aos-cd-jobs/sjb/actions/named_shell_task.py. I think making the job "fail-able" would be a good idea -- we would need to thread through a new config option on shell steps that could default to making it fail on error, but allow us to turn it off otherwise. Do you think you can make a PR to tackle this?

@soltysh
Copy link
Contributor Author

soltysh commented Jun 20, 2017

@stevekuznetsov on a second thought I started wondering. If this is supposed to fix the flake, we need to enable it by default and only disable on demand. With this option being turned off we don't get rid off the flake. Why do you think this should be off by default?

@stevekuznetsov
Copy link
Contributor

The named_shell_task handles all shell tasks with a name in those jobs -- so that is provisioning, building, testing, etc -- it's only the deprovisioning and cleanup tasks where we want to be able to optionally disable the -o errexit flag to allow those steps to fail silently while doing best-effort work.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jun 20, 2017 via email

@soltysh
Copy link
Contributor Author

soltysh commented Jun 21, 2017

Or just rerun if a failure happens?

I think I'll go with that, although I'm not 100% sure which step failed, I don't have logs anymore :( I'll dig more.

@stevekuznetsov
Copy link
Contributor

I'm not sure retry is always appropriate -- for instance, in many deprovision/cleanup stages we fail to grab something from the remote host for whatever reason, we can live without that artifact, but we'd rather successfully continue on to actually deprovisioning the machine. Doing a re-try loop might be simpler (and require only job stage edits) but the -o errexit optional stuff could be very useful past this specific issue.

@soltysh
Copy link
Contributor Author

soltysh commented Jun 21, 2017

I was originally thinking about a retry logic for that particular use-case. Adding that option you've mentioned is not a problem, either.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jun 21, 2017 via email

@smarterclayton
Copy link
Contributor

smarterclayton commented Jun 21, 2017 via email

@stevekuznetsov
Copy link
Contributor

OK, sounds reasonable. The jobs would be better off if we could turn off early exit on error, as that makes our deprovision safer -- today if a pre-deprovision step like grabbing logs fails, we don't run deprovision as we could not describe the "always run this step even on failure" in Jenkins. The deprovision.sh script would be better off re-trying.

@stevekuznetsov
Copy link
Contributor

@mfojtik !?

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2018
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 16, 2018
@stevekuznetsov
Copy link
Contributor

This has actually been fixed by doing a trap to make those stages never fail.

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/infrastructure kind/test-flake Categorizes issue or PR as related to test flakes. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P2
Projects
None yet
Development

No branches or pull requests

7 participants