Statefulset creates and deletes pod repeatedly, race condition or other error? #17435

smarterclayton · 2017-11-22T23:12:16Z

In master:

I1122 23:10:02.330380   98410 event.go:218] Event(v1.ObjectReference{Kind:"StatefulSet", Namespace:"kube-system", Name:"prometheus", UID:"a7de8d06-cfce-11e7-9fbc-080027893417", APIVersion:"apps", ResourceVersion:"64043", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' create Pod prometheus-0 in StatefulSet prometheus successful
I1122 23:10:02.337665   98410 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"prometheus-0", UID:"44dd76cc-cfda-11e7-9a7c-080027893417", APIVersion:"v1", ResourceVersion:"64057", FieldPath:""}): type: 'Normal' reason: 'Scheduled' Successfully assigned prometheus-0 to localhost.localdomain
I1122 23:10:02.357457   98410 event.go:218] Event(v1.ObjectReference{Kind:"StatefulSet", Namespace:"kube-system", Name:"prometheus", UID:"a7de8d06-cfce-11e7-9fbc-080027893417", APIVersion:"apps", ResourceVersion:"64062", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' delete Pod prometheus-0 in StatefulSet prometheus successful
E1122 23:10:02.364507   98410 stateful_set.go:395] Error syncing StatefulSet kube-system/prometheus, requeuing: StatefulSet.apps "prometheus" is invalid: status.currentReplicas: Invalid value: -1: must be greater than or equal to 0

Not sure if this is a new post-rebase bug, a known fixed in upstream, or otherwise broken. This could potentially be very serious in a 3.7->3.8->3.9 fast rolling update.

@openshift/sig-master

The text was updated successfully, but these errors were encountered:

smarterclayton · 2017-11-22T23:15:08Z

I was able to create it and it worked fine for a while, but after updating the stateful set it began doing this. It might be an error with controller history? No other obvious errors in logs at v=2.

Scenario:

create prometheus example on master
wait until it's up, view it
edit the stateful set

Actual: infinite loop of create and delete.

Could be related to image stream tag resolution? Needs master team to look at (nothing explicit about image stream resolution in place).

smarterclayton · 2017-11-22T23:15:34Z

$ oc get controllerrevisions
NAME                    CONTROLLER               REVISION   AGE
prometheus-69b9c88f94   StatefulSet/prometheus   1          0s
prometheus-69b9c88f95   StatefulSet/prometheus   1          1s
prometheus-69b9c88f96   StatefulSet/prometheus   1          1s
prometheus-69b9c88f97   StatefulSet/prometheus   1          1s
prometheus-69b9c88f98   StatefulSet/prometheus   1          1s
prometheus-69b9c88f99   StatefulSet/prometheus   1          1s
prometheus-69b9c88f9b   StatefulSet/prometheus   1          1s
prometheus-69b9c88f9c   StatefulSet/prometheus   1          1s
prometheus-69b9c88f9d   StatefulSet/prometheus   1          1s
prometheus-69b9c88f9f   StatefulSet/prometheus   1          1s
prometheus-69d6966956   StatefulSet/prometheus   1          1h
prometheus-69d6966957   StatefulSet/prometheus   1          1h
prometheus-69d6966958   StatefulSet/prometheus   1          1h
prometheus-69d6966959   StatefulSet/prometheus   1          1h
prometheus-69d696695b   StatefulSet/prometheus   1          1s
prometheus-69d696695c   StatefulSet/prometheus   1          1s
prometheus-69d696695d   StatefulSet/prometheus   1          1h
prometheus-69d696695f   StatefulSet/prometheus   1          5m
prometheus-69d6966966   StatefulSet/prometheus   1          1s
prometheus-69d6966967   StatefulSet/prometheus   1          1s
prometheus-86595ffcb7   StatefulSet/prometheus   1          1h
prometheus-86595ffcbb   StatefulSet/prometheus   1          1h
prometheus-86595ffcbc   StatefulSet/prometheus   1          1h
prometheus-86595ffcc8   StatefulSet/prometheus   1          1h
prometheus-86595ffcc9   StatefulSet/prometheus   1          1h

terrifying. Deleting the revisions got me back to working state.

smarterclayton · 2017-11-22T23:26:11Z

Happens on restart. Create stateful set, then update. Have two revisions. Restart the master process and it goes crazy:

○ oc get controllerrevisions
NAME                    CONTROLLER               REVISION   AGE
prometheus-69b9c88f95   StatefulSet/prometheus   1          9m
prometheus-6b67d68cd5   StatefulSet/prometheus   2          7m

restart, starts growing up to 12, all the same

○ oc get controllerrevisions
NAME                    CONTROLLER               REVISION   AGE
prometheus-69b9c88f95   StatefulSet/prometheus   1          9m
prometheus-6b4c48c5d8   StatefulSet/prometheus   1          0s
prometheus-6b4c48c5d9   StatefulSet/prometheus   1          0s
prometheus-6b4c48c5db   StatefulSet/prometheus   1          0s
prometheus-6b4c48c5dc   StatefulSet/prometheus   1          0s
prometheus-6b67d68cd4   StatefulSet/prometheus   1          0s
prometheus-6b67d68cd5   StatefulSet/prometheus   2          7m
○ oc get controllerrevisions
NAME                    CONTROLLER               REVISION   AGE
prometheus-696759f598   StatefulSet/prometheus   1          2s
prometheus-696759f599   StatefulSet/prometheus   1          2s
prometheus-69b9c88f95   StatefulSet/prometheus   1          9m
prometheus-6b4c48c5d8   StatefulSet/prometheus   1          3s
prometheus-6b4c48c5d9   StatefulSet/prometheus   1          3s
prometheus-6b4c48c5db   StatefulSet/prometheus   1          3s
prometheus-6b4c48c5df   StatefulSet/prometheus   1          3s
prometheus-6b4c48c5f4   StatefulSet/prometheus   1          3s
prometheus-6b4c48c5f5   StatefulSet/prometheus   1          3s
prometheus-6b4c48c5f9   StatefulSet/prometheus   1          2s
prometheus-6b67d68cd4   StatefulSet/prometheus   1          3s
prometheus-6b67d68cd5   StatefulSet/prometheus   2          7m
○ oc get controllerrevisions
NAME                    CONTROLLER               REVISION   AGE
prometheus-696759f598   StatefulSet/prometheus   1          3s
prometheus-696759f599   StatefulSet/prometheus   1          3s
prometheus-69b9c88f95   StatefulSet/prometheus   1          9m
prometheus-6b4c48c5d8   StatefulSet/prometheus   1          4s
prometheus-6b4c48c5d9   StatefulSet/prometheus   1          4s
prometheus-6b4c48c5db   StatefulSet/prometheus   1          4s
prometheus-6b4c48c5df   StatefulSet/prometheus   1          4s
prometheus-6b4c48c5f4   StatefulSet/prometheus   1          4s
prometheus-6b4c48c5f5   StatefulSet/prometheus   1          4s
prometheus-6b4c48c5f9   StatefulSet/prometheus   1          3s
prometheus-6b67d68cd4   StatefulSet/prometheus   1          4s
prometheus-6b67d68cd5   StatefulSet/prometheus   2          7m

smarterclayton · 2017-11-22T23:28:27Z

Diff between two random revision 1:

○ diff <(oc get controllerrevisions prometheus-696759f599 -o yaml) <(oc get controllerrevisions prometheus-6b4c48c5f4 -o yaml)
171c171
<   creationTimestamp: 2017-11-22T23:25:16Z
---
>   creationTimestamp: 2017-11-22T23:25:15Z
174,175c174,175
<     controller.kubernetes.io/hash: "2607047195"
<   name: prometheus-696759f599
---
>     controller.kubernetes.io/hash: "2607047185"
>   name: prometheus-6b4c48c5f4
184,186c184,186
<   resourceVersion: "65759"
<   selfLink: /apis/apps/v1beta1/namespaces/kube-system/controllerrevisions/prometheus-696759f599
<   uid: 658f97e2-cfdc-11e7-b678-080027893417
---
>   resourceVersion: "65745"
>   selfLink: /apis/apps/v1beta1/namespaces/kube-system/controllerrevisions/prometheus-6b4c48c5f4
>   uid: 656561b6-cfdc-11e7-b678-080027893417

Something about hash calculation is wrong. Maybe unstable ordering of an underlying map when calculating the hash?

tnozicka · 2017-11-24T12:52:52Z

I could reproduce it even by just creating the StatefulSet and restarting master. I suspect this is caused by some rouge retry and collision avoidance for hash. Looking...

tnozicka · 2017-11-24T16:07:16Z

@smarterclayton this is broken upstream as well.

I think newRevision is now not re-entrant because of kubernetes/kubernetes#50490
That revealed the fact that we don't wait for rev informer when starting StatefulSet controller (which I am fixing now).

I think we will need to add expectations, although it seems to be working now, or revert the collision avoidance PR. I suspect it will not work for rollback but I need to check it next week.

smarterclayton · 2017-11-24T16:55:38Z

Can you make sure there is a high severity Kube issue blocking/impacting the 1.9 release? We need to at least triage it. Thanks On Nov 24, 2017, at 11:09 AM, Tomáš Nožička <[email protected]> wrote: @smarterclayton <https://github.com/smarterclayton> this is broken upstream as well. I think newRevision is now not re-entrant because of https://github.com/kubernetes/kubernetes/pull/50490/files That revealed the fact that we don't wait for rev informer when starting StatefulSet controller (which I am fixing now). I think we will need to add expectations, although it seems to be working now, or revert the collision avoidance PR. I suspect it will not work for rollback but I need to check it next week. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17435 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p6w1HYbfnByhZmkH6z4XuCU2NJqtks5s5um3gaJpZM4QoG6O> .

tnozicka · 2017-11-24T21:18:13Z

@smarterclayton I have labeled both the issue and PR, not sure how to make it impact 1.9 release because I can't set the milestone

kow3ns · 2017-11-25T18:29:43Z

I set the milestone and added approval for the 1.9 milestone

@smarterclayton

…mersync Automatic merge from submit-queue. UPSTREAM: 56356: Wait for controllerrevision informer to sync on statefulset controller startup /cc @smarterclayton @mfojtik fixes #17435

smarterclayton added priority/P1 sig/master labels Nov 22, 2017

mfojtik assigned tnozicka Nov 23, 2017

tnozicka added the component/apps label Nov 24, 2017

tnozicka added this to the 3.8.0 milestone Nov 24, 2017

tnozicka mentioned this issue Nov 24, 2017

StatefulSet creates multiple controllerrevisions on controller restart kubernetes/kubernetes#56355

Closed

tnozicka mentioned this issue Dec 5, 2017

UPSTREAM: 56356: Wait for controllerrevision informer to sync on statefulset controller startup #17513

Merged

openshift-merge-robot closed this as completed in #17513 Dec 5, 2017

jrife mentioned this issue Jan 16, 2018

Kubernetes Keeps Restarting Pods of StatefulSet With “Need to kill pod” As The Only Indication Why kubernetes/kubernetes#58347

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statefulset creates and deletes pod repeatedly, race condition or other error? #17435

Statefulset creates and deletes pod repeatedly, race condition or other error? #17435

smarterclayton commented Nov 22, 2017

smarterclayton commented Nov 22, 2017

smarterclayton commented Nov 22, 2017 •

edited

Loading

smarterclayton commented Nov 22, 2017 •

edited

Loading

smarterclayton commented Nov 22, 2017

tnozicka commented Nov 24, 2017

tnozicka commented Nov 24, 2017 •

edited

Loading

smarterclayton commented Nov 24, 2017 via email

tnozicka commented Nov 24, 2017 •

edited

Loading

kow3ns commented Nov 25, 2017

Statefulset creates and deletes pod repeatedly, race condition or other error? #17435

Statefulset creates and deletes pod repeatedly, race condition or other error? #17435

Comments

smarterclayton commented Nov 22, 2017

smarterclayton commented Nov 22, 2017

smarterclayton commented Nov 22, 2017 • edited Loading

smarterclayton commented Nov 22, 2017 • edited Loading

smarterclayton commented Nov 22, 2017

tnozicka commented Nov 24, 2017

tnozicka commented Nov 24, 2017 • edited Loading

smarterclayton commented Nov 24, 2017 via email

tnozicka commented Nov 24, 2017 • edited Loading

kow3ns commented Nov 25, 2017

smarterclayton commented Nov 22, 2017 •

edited

Loading

smarterclayton commented Nov 22, 2017 •

edited

Loading

tnozicka commented Nov 24, 2017 •

edited

Loading

tnozicka commented Nov 24, 2017 •

edited

Loading