/tmp/deprovision.sh failed #13559

soltysh · 2017-03-28T14:48:51Z

Seen in https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_conformance_gce/635/console. The direct error seems to be related with this particular GKE failure:

ERROR: (gcloud.compute.instance-templates.delete) Some requests did not succeed:
 - Code: '385079975793035184'", "stdout": "", "stdout_lines": [], "warnings": []}

The text was updated successfully, but these errors were encountered:

stevekuznetsov · 2017-03-29T01:41:19Z

@smarterclayton when was the jUnit + ansible output stuff supposed to bubble up to the image being used? We don't have this logged.

smarterclayton · 2017-03-29T08:48:41Z

Uh...

soltysh · 2017-06-05T13:24:20Z

Another one in https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_conformance_gce/2729/

soltysh · 2017-06-05T13:28:00Z

Deleted [https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci/global/instanceTemplates/ci-prtest-5a37c28-2729-instance-template-master].
ERROR: (gcloud.compute.firewall-rules.delete) Some requests did not succeed:
 - Internal Error

Looks like internal GCE error, not sure if we're able to do with it.

smarterclayton · 2017-06-06T04:22:39Z

I opened an issue and they couldn't figure out how to help me.

…

On Mon, Jun 5, 2017 at 9:28 AM, Maciej Szulik ***@***.***> wrote: Deleted [https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci/global/instanceTemplates/ci-prtest-5a37c28-2729-instance-template-master]. ERROR: (gcloud.compute.firewall-rules.delete) Some requests did not succeed: - Internal Error Looks like internal GCE error, not sure if we're able to do with it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13559 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pzdxuYV8dinqFAKN5ejDOfbHU144ks5sBAJjgaJpZM4Mru_Y> .

soltysh · 2017-06-06T09:10:25Z

Not cool :/

soltysh · 2017-06-20T14:29:12Z

@stevekuznetsov @smarterclayton would it hurt if we ignore errors at the de-provision stage, I mean remove set -o errexit from this line: https://github.com/openshift/aos-cd-jobs/blob/da4567fc5a8dc4a804b9b02803a4d49dd8657816/sjb/generated/test_pull_request_origin_extended_conformance_gce.xml#L347 ?

stevekuznetsov · 2017-06-20T14:46:22Z

Today all shell tasks are generated from the template at aos-cd-jobs/sjb/actions/named_shell_task.py. I think making the job "fail-able" would be a good idea -- we would need to thread through a new config option on shell steps that could default to making it fail on error, but allow us to turn it off otherwise. Do you think you can make a PR to tackle this?

soltysh · 2017-06-20T18:54:57Z

@stevekuznetsov on a second thought I started wondering. If this is supposed to fix the flake, we need to enable it by default and only disable on demand. With this option being turned off we don't get rid off the flake. Why do you think this should be off by default?

stevekuznetsov · 2017-06-20T19:36:58Z

The named_shell_task handles all shell tasks with a name in those jobs -- so that is provisioning, building, testing, etc -- it's only the deprovisioning and cleanup tasks where we want to be able to optionally disable the -o errexit flag to allow those steps to fail silently while doing best-effort work.

smarterclayton · 2017-06-20T23:05:54Z

The flake is we leave things uncleaned up in gce, why don't you improve deprovision to retry certain operations? Or just rerun if a failure happens? On Jun 20, 2017, at 3:37 PM, Steve Kuznetsov <[email protected]> wrote: The named_shell_task handles all shell tasks with a name in those jobs -- so that is provisioning, building, testing, etc -- it's only the deprovisioning and cleanup tasks where we want to be able to optionally disable the -o errexit flag to allow those steps to fail silently while doing best-effort work. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13559 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pxgfRTrPo4ov_nvq8M_2zW-SEBgTks5sGB9egaJpZM4Mru_Y> .

soltysh · 2017-06-21T07:30:07Z

Or just rerun if a failure happens?

I think I'll go with that, although I'm not 100% sure which step failed, I don't have logs anymore :( I'll dig more.

stevekuznetsov · 2017-06-21T13:40:17Z

I'm not sure retry is always appropriate -- for instance, in many deprovision/cleanup stages we fail to grab something from the remote host for whatever reason, we can live without that artifact, but we'd rather successfully continue on to actually deprovisioning the machine. Doing a re-try loop might be simpler (and require only job stage edits) but the -o errexit optional stuff could be very useful past this specific issue.

soltysh · 2017-06-21T15:30:06Z

I was originally thinking about a retry logic for that particular use-case. Adding that option you've mentioned is not a problem, either.

smarterclayton · 2017-06-21T16:18:08Z

The retry should be inside of deprovision and origin-gce, not in the job. The playbook deprovision already has to be idempotent, and if some cloud providers have bugs, that's just the way of the world.

…

On Wed, Jun 21, 2017 at 11:30 AM, Maciej Szulik ***@***.***> wrote: I was originally thinking about a retry logic for that particular use-case. Adding that option you've mentioned is not a problem, either. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13559 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pxWahvp29R_23LHUq8XtfQwWTU-Vks5sGTcCgaJpZM4Mru_Y> .

smarterclayton · 2017-06-21T16:18:22Z

None of this should be things the job is aware of. On Wed, Jun 21, 2017 at 12:18 PM, Clayton Coleman <[email protected]> wrote:

…

The retry should be inside of deprovision and origin-gce, not in the job. The playbook deprovision already has to be idempotent, and if some cloud providers have bugs, that's just the way of the world. On Wed, Jun 21, 2017 at 11:30 AM, Maciej Szulik ***@***.***> wrote: > I was originally thinking about a retry logic for that particular > use-case. Adding that option you've mentioned is not a problem, either. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#13559 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABG_pxWahvp29R_23LHUq8XtfQwWTU-Vks5sGTcCgaJpZM4Mru_Y> > . >

stevekuznetsov · 2017-06-21T16:23:28Z

OK, sounds reasonable. The jobs would be better off if we could turn off early exit on error, as that makes our deprovision safer -- today if a pre-deprovision step like grabbing logs fails, we don't run deprovision as we could not describe the "always run this step even on failure" in Jenkins. The deprovision.sh script would be better off re-trying.

stevekuznetsov · 2017-06-26T15:36:02Z

@mfojtik !?

openshift-bot · 2018-02-13T04:38:39Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2018-03-16T11:12:14Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

stevekuznetsov · 2018-03-16T14:54:36Z

This has actually been fixed by doing a trap to make those stages never fail.

/close

soltysh added the kind/test-flake Categorizes issue or PR as related to test flakes. label Mar 28, 2017

soltysh mentioned this issue Mar 28, 2017

Update instructions to explicitly tag an image from docker repository #13552

Merged

soltysh added the area/infrastructure label Mar 28, 2017

pweil- added the priority/P1 label Mar 30, 2017

pweil- assigned soltysh Mar 30, 2017

soltysh mentioned this issue Jun 5, 2017

Bug 1450291 - Improve logs in image pruning #14405

Merged

0xmichalis mentioned this issue Jun 6, 2017

Move docker specific scripts back into image directories #14482

Closed

soltysh mentioned this issue Jun 7, 2017

Step "De-provision GCE resources" fails with "Internal error" #14497

Closed

mfojtik added priority/P2 and removed priority/P1 labels Jun 26, 2017

mfojtik assigned stevekuznetsov and unassigned soltysh Jun 26, 2017

stevekuznetsov assigned soltysh Jun 26, 2017

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2018

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 16, 2018

openshift-ci-robot closed this as completed Mar 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/tmp/deprovision.sh failed #13559

/tmp/deprovision.sh failed #13559

soltysh commented Mar 28, 2017

stevekuznetsov commented Mar 29, 2017

smarterclayton commented Mar 29, 2017

soltysh commented Jun 5, 2017

soltysh commented Jun 5, 2017

smarterclayton commented Jun 6, 2017 via email

soltysh commented Jun 6, 2017

soltysh commented Jun 20, 2017

stevekuznetsov commented Jun 20, 2017

soltysh commented Jun 20, 2017

stevekuznetsov commented Jun 20, 2017

smarterclayton commented Jun 20, 2017 via email

soltysh commented Jun 21, 2017

stevekuznetsov commented Jun 21, 2017

soltysh commented Jun 21, 2017

smarterclayton commented Jun 21, 2017 via email

smarterclayton commented Jun 21, 2017 via email

stevekuznetsov commented Jun 21, 2017

stevekuznetsov commented Jun 26, 2017

openshift-bot commented Feb 13, 2018

openshift-bot commented Mar 16, 2018

stevekuznetsov commented Mar 16, 2018

/tmp/deprovision.sh failed #13559

/tmp/deprovision.sh failed #13559

Comments

soltysh commented Mar 28, 2017

stevekuznetsov commented Mar 29, 2017

smarterclayton commented Mar 29, 2017

soltysh commented Jun 5, 2017

soltysh commented Jun 5, 2017

smarterclayton commented Jun 6, 2017 via email

soltysh commented Jun 6, 2017

soltysh commented Jun 20, 2017

stevekuznetsov commented Jun 20, 2017

soltysh commented Jun 20, 2017

stevekuznetsov commented Jun 20, 2017

smarterclayton commented Jun 20, 2017 via email

soltysh commented Jun 21, 2017

stevekuznetsov commented Jun 21, 2017

soltysh commented Jun 21, 2017

smarterclayton commented Jun 21, 2017 via email

smarterclayton commented Jun 21, 2017 via email

stevekuznetsov commented Jun 21, 2017

stevekuznetsov commented Jun 26, 2017

openshift-bot commented Feb 13, 2018

openshift-bot commented Mar 16, 2018

stevekuznetsov commented Mar 16, 2018