Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statefulset creates and deletes pod repeatedly, race condition or other error? #17435

Closed
smarterclayton opened this issue Nov 22, 2017 · 9 comments · Fixed by #17513
Closed

Comments

@smarterclayton
Copy link
Contributor

In master:

I1122 23:10:02.330380   98410 event.go:218] Event(v1.ObjectReference{Kind:"StatefulSet", Namespace:"kube-system", Name:"prometheus", UID:"a7de8d06-cfce-11e7-9fbc-080027893417", APIVersion:"apps", ResourceVersion:"64043", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' create Pod prometheus-0 in StatefulSet prometheus successful
I1122 23:10:02.337665   98410 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"prometheus-0", UID:"44dd76cc-cfda-11e7-9a7c-080027893417", APIVersion:"v1", ResourceVersion:"64057", FieldPath:""}): type: 'Normal' reason: 'Scheduled' Successfully assigned prometheus-0 to localhost.localdomain
I1122 23:10:02.357457   98410 event.go:218] Event(v1.ObjectReference{Kind:"StatefulSet", Namespace:"kube-system", Name:"prometheus", UID:"a7de8d06-cfce-11e7-9fbc-080027893417", APIVersion:"apps", ResourceVersion:"64062", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' delete Pod prometheus-0 in StatefulSet prometheus successful
E1122 23:10:02.364507   98410 stateful_set.go:395] Error syncing StatefulSet kube-system/prometheus, requeuing: StatefulSet.apps "prometheus" is invalid: status.currentReplicas: Invalid value: -1: must be greater than or equal to 0

Not sure if this is a new post-rebase bug, a known fixed in upstream, or otherwise broken. This could potentially be very serious in a 3.7->3.8->3.9 fast rolling update.

@openshift/sig-master

@smarterclayton
Copy link
Contributor Author

I was able to create it and it worked fine for a while, but after updating the stateful set it began doing this. It might be an error with controller history? No other obvious errors in logs at v=2.

Scenario:

  1. create prometheus example on master
  2. wait until it's up, view it
  3. edit the stateful set

Actual: infinite loop of create and delete.

Could be related to image stream tag resolution? Needs master team to look at (nothing explicit about image stream resolution in place).

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Nov 22, 2017

$ oc get controllerrevisions
NAME                    CONTROLLER               REVISION   AGE
prometheus-69b9c88f94   StatefulSet/prometheus   1          0s
prometheus-69b9c88f95   StatefulSet/prometheus   1          1s
prometheus-69b9c88f96   StatefulSet/prometheus   1          1s
prometheus-69b9c88f97   StatefulSet/prometheus   1          1s
prometheus-69b9c88f98   StatefulSet/prometheus   1          1s
prometheus-69b9c88f99   StatefulSet/prometheus   1          1s
prometheus-69b9c88f9b   StatefulSet/prometheus   1          1s
prometheus-69b9c88f9c   StatefulSet/prometheus   1          1s
prometheus-69b9c88f9d   StatefulSet/prometheus   1          1s
prometheus-69b9c88f9f   StatefulSet/prometheus   1          1s
prometheus-69d6966956   StatefulSet/prometheus   1          1h
prometheus-69d6966957   StatefulSet/prometheus   1          1h
prometheus-69d6966958   StatefulSet/prometheus   1          1h
prometheus-69d6966959   StatefulSet/prometheus   1          1h
prometheus-69d696695b   StatefulSet/prometheus   1          1s
prometheus-69d696695c   StatefulSet/prometheus   1          1s
prometheus-69d696695d   StatefulSet/prometheus   1          1h
prometheus-69d696695f   StatefulSet/prometheus   1          5m
prometheus-69d6966966   StatefulSet/prometheus   1          1s
prometheus-69d6966967   StatefulSet/prometheus   1          1s
prometheus-86595ffcb7   StatefulSet/prometheus   1          1h
prometheus-86595ffcbb   StatefulSet/prometheus   1          1h
prometheus-86595ffcbc   StatefulSet/prometheus   1          1h
prometheus-86595ffcc8   StatefulSet/prometheus   1          1h
prometheus-86595ffcc9   StatefulSet/prometheus   1          1h

terrifying. Deleting the revisions got me back to working state.

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Nov 22, 2017

Happens on restart. Create stateful set, then update. Have two revisions. Restart the master process and it goes crazy:

○ oc get controllerrevisions
NAME                    CONTROLLER               REVISION   AGE
prometheus-69b9c88f95   StatefulSet/prometheus   1          9m
prometheus-6b67d68cd5   StatefulSet/prometheus   2          7m

restart, starts growing up to 12, all the same

○ oc get controllerrevisions
NAME                    CONTROLLER               REVISION   AGE
prometheus-69b9c88f95   StatefulSet/prometheus   1          9m
prometheus-6b4c48c5d8   StatefulSet/prometheus   1          0s
prometheus-6b4c48c5d9   StatefulSet/prometheus   1          0s
prometheus-6b4c48c5db   StatefulSet/prometheus   1          0s
prometheus-6b4c48c5dc   StatefulSet/prometheus   1          0s
prometheus-6b67d68cd4   StatefulSet/prometheus   1          0s
prometheus-6b67d68cd5   StatefulSet/prometheus   2          7m
○ oc get controllerrevisions
NAME                    CONTROLLER               REVISION   AGE
prometheus-696759f598   StatefulSet/prometheus   1          2s
prometheus-696759f599   StatefulSet/prometheus   1          2s
prometheus-69b9c88f95   StatefulSet/prometheus   1          9m
prometheus-6b4c48c5d8   StatefulSet/prometheus   1          3s
prometheus-6b4c48c5d9   StatefulSet/prometheus   1          3s
prometheus-6b4c48c5db   StatefulSet/prometheus   1          3s
prometheus-6b4c48c5df   StatefulSet/prometheus   1          3s
prometheus-6b4c48c5f4   StatefulSet/prometheus   1          3s
prometheus-6b4c48c5f5   StatefulSet/prometheus   1          3s
prometheus-6b4c48c5f9   StatefulSet/prometheus   1          2s
prometheus-6b67d68cd4   StatefulSet/prometheus   1          3s
prometheus-6b67d68cd5   StatefulSet/prometheus   2          7m
○ oc get controllerrevisions
NAME                    CONTROLLER               REVISION   AGE
prometheus-696759f598   StatefulSet/prometheus   1          3s
prometheus-696759f599   StatefulSet/prometheus   1          3s
prometheus-69b9c88f95   StatefulSet/prometheus   1          9m
prometheus-6b4c48c5d8   StatefulSet/prometheus   1          4s
prometheus-6b4c48c5d9   StatefulSet/prometheus   1          4s
prometheus-6b4c48c5db   StatefulSet/prometheus   1          4s
prometheus-6b4c48c5df   StatefulSet/prometheus   1          4s
prometheus-6b4c48c5f4   StatefulSet/prometheus   1          4s
prometheus-6b4c48c5f5   StatefulSet/prometheus   1          4s
prometheus-6b4c48c5f9   StatefulSet/prometheus   1          3s
prometheus-6b67d68cd4   StatefulSet/prometheus   1          4s
prometheus-6b67d68cd5   StatefulSet/prometheus   2          7m

@smarterclayton
Copy link
Contributor Author

Diff between two random revision 1:

○ diff <(oc get controllerrevisions prometheus-696759f599 -o yaml) <(oc get controllerrevisions prometheus-6b4c48c5f4 -o yaml)
171c171
<   creationTimestamp: 2017-11-22T23:25:16Z
---
>   creationTimestamp: 2017-11-22T23:25:15Z
174,175c174,175
<     controller.kubernetes.io/hash: "2607047195"
<   name: prometheus-696759f599
---
>     controller.kubernetes.io/hash: "2607047185"
>   name: prometheus-6b4c48c5f4
184,186c184,186
<   resourceVersion: "65759"
<   selfLink: /apis/apps/v1beta1/namespaces/kube-system/controllerrevisions/prometheus-696759f599
<   uid: 658f97e2-cfdc-11e7-b678-080027893417
---
>   resourceVersion: "65745"
>   selfLink: /apis/apps/v1beta1/namespaces/kube-system/controllerrevisions/prometheus-6b4c48c5f4
>   uid: 656561b6-cfdc-11e7-b678-080027893417

Something about hash calculation is wrong. Maybe unstable ordering of an underlying map when calculating the hash?

@tnozicka
Copy link
Contributor

I could reproduce it even by just creating the StatefulSet and restarting master. I suspect this is caused by some rouge retry and collision avoidance for hash. Looking...

@tnozicka tnozicka added this to the 3.8.0 milestone Nov 24, 2017
@tnozicka
Copy link
Contributor

tnozicka commented Nov 24, 2017

@smarterclayton this is broken upstream as well.

I think newRevision is now not re-entrant because of kubernetes/kubernetes#50490
That revealed the fact that we don't wait for rev informer when starting StatefulSet controller (which I am fixing now).

I think we will need to add expectations, although it seems to be working now, or revert the collision avoidance PR. I suspect it will not work for rollback but I need to check it next week.

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Nov 24, 2017 via email

@tnozicka
Copy link
Contributor

tnozicka commented Nov 24, 2017

@smarterclayton I have labeled both the issue and PR, not sure how to make it impact 1.9 release because I can't set the milestone

@kow3ns
Copy link

kow3ns commented Nov 25, 2017

I set the milestone and added approval for the 1.9 milestone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants