Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use shared informer in BuildPodController and BuildPodDeleteController #13510

Merged
merged 1 commit into from
Apr 4, 2017

Conversation

csrwng
Copy link
Contributor

@csrwng csrwng commented Mar 22, 2017

Changes:

  • Creates a new controller under a buildpod package that follows
    conventions in deployment controllers.
  • Refactors common controller functions into a new controller/common
    package for use by existing and new controllers.
  • Adds informer and lister for builds
  • Removes previous controller code.

@csrwng
Copy link
Contributor Author

csrwng commented Mar 22, 2017

[test]

@csrwng
Copy link
Contributor Author

csrwng commented Mar 22, 2017

@bparees ptal

Copy link
Contributor

@bparees bparees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there still a podDeleteLW that is no longer used?
And how is the logic podDeleteLW implemented managed by the shared informers? (specifically, podDeleteListWatcher will, on List, create fake pod delete events for missing build pods, as part of the reconciliation logic)

// NewBuildPodController creates a new BuildPodController.
func NewBuildPodController(buildInformer, podInformer cache.SharedIndexInformer, kc kclientset.Interface, oc osclient.Interface) *BuildPodController {
buildUpdater := buildclient.NewOSClientBuildClient(oc)
buildLister := buildclient.NewOSClientBuildClient(oc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we can't use the same object for both of these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reason, will collapse

obj, exists, err := bc.podStore.Indexer.GetByKey(key)
if err != nil {
glog.Infof("Unable to retrieve pod %q from store: %v", key, err)
bc.queue.Add(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going to hot loop or does the queue have some rate-limiting built into it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not adding with a rate limit here though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh by the way the only error you can get from the cache is a NotFound, fyi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, yes it should be AddRateLimited

return nil, err
}
if !exists {
glog.Infof("Pod %q has been deleted", key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we really be confident of this? there may be cases where the pod cache is out of date and doesn't yet contain the build pod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we wouldn't have the key in the queue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah ok i was thinking of a different code flow where we get a build event, lookup the buildpod annotation value, and then try to retrieve the build pod by that key. In that case, it's a problem to query a cache for the build pod because it could result in a cache miss and a false result that the pod doesn't exist when it does.

(and the inverse case, where we get a pod event, look up the build associated w/ the pod and again, could have a cache miss if the cache is stale)

@@ -0,0 +1,213 @@
package buildpod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not obvious to me why we have a "factory" here, seems like all this content should just be moved into controller.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to keep the object/event processing code separate from the setup/creation code. We don't need to call it factory ... maybe infra.go ? The deployment controllers still call it factory, which is why I reused the name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced separating that is necessary or even desirable. I generally like it when all my object methods are in the same file as the object definition itself. We don't create separate files for all our other New*() functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is common in go to have methods of a single struct span multiple files.
But I don't feel strongly one way or the other, so will collapse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is common in go to have methods of a single struct span multiple files.

and maybe someday someone will convince me that's a benefit :)
(showing my java predilection, i know)

return
}

if bc.queue.NumRequeues(key) < maxRetries {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when does NumRequeues get reset? we could see the same key multiple times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets reset when we process it successfully

if !isBuildPod(pod) {
return nil, false, errors.New("cannot get build for pod: pod is not a build pod")
}
build, err := bc.buildStore.Builds(pod.Namespace).Get(buildutil.GetBuildName(pod))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming this is also a cache, same question here about cases where the cache is stale and the build really does exist despite not being in the cache yet.

Copy link
Contributor Author

@csrwng csrwng Mar 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question ... today we simply say it's not found and keep on going:
https://github.com/openshift/origin/blob/master/pkg/build/controller/controller.go#L295-L298

It's hard to imagine a situation where we're seeing the pod, but the build is not available, given that the build had to be in a cache first to generate the pod. It's true that until we don't use the single informers for all controllers, these are going to be different caches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the concern here would be if someone deletes a pod for a running build and we go to look up the build for that pod and can't find it, we will not properly mark the build as "failed/error: pod was deleted"

as you say, probably unlikely in this particular flow since we have to have seen the build at some point to have created the pod, so maybe we can live with it, but for all these cases where we are looking up other resources in caches we need to think about the implications of a stale cache/race. I really wish we had a read-through cache option that would hit the api server if the item isn't found in the cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the concern here would be if someone deletes a pod for a running build and we go to look up the build for that pod and can't find it, we will not properly mark the build as "failed/error: pod was deleted"

Yes we will because we will handle the pod deleted event

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we will because we will handle the pod deleted event

?

handling the pod deleted event properly requires that we are able to find the build associated w/ the pod that we got a delete event for. If we can't find the build because it's not in the cache yet(and we don't check the api server directly), we can't handle the event properly, because handling the event properly requires updating the build object.

again, i think you've convinced me it's not possible for the build not to be in the cache in this particular case, but the general class of problem is real.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We collapsed the deployer pod controller into the deployments controller and moved the pod lookup to be a cache lookup w/o any major problem. The change was 1.3 or 1.4, can't remember.

@csrwng csrwng changed the title Used shared informer in BuildPodController and BuildPodDeleteController Use shared informer in BuildPodController and BuildPodDeleteController Mar 22, 2017
@csrwng
Copy link
Contributor Author

csrwng commented Mar 22, 2017

The delete controller has been replaced by the event listener on the informer:
https://github.com/openshift/origin/pull/13510/files#diff-8ffc04560dc75409e4c5a57523e707dbR48
Which is the same thing deployment uses to process deletion events.

I will remove the podDeleteLW struct, forgot to remove.

@bparees
Copy link
Contributor

bparees commented Mar 22, 2017

The delete controller has been replaced by the event listener on the informer:
https://github.com/openshift/origin/pull/13510/files#diff-8ffc04560dc75409e4c5a57523e707dbR48
Which is the same thing deployment uses to process deletion events.

how does that handle reconciliation? (ie what happens if we miss the delete event? what resends it?)

@csrwng
Copy link
Contributor Author

csrwng commented Mar 22, 2017

how does that handle reconciliation? (ie what happens if we miss the delete event? what resends it?)

The shared informer is using the same deltafifo we were using under the covers:
https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/client/cache/shared_informer.go#L164

@bparees
Copy link
Contributor

bparees commented Mar 22, 2017

The shared informer is using the same deltafifo we were using under the covers

that doesn't handle reconciliation (it really can't).

this is how we handle reconciliation:
https://github.com/openshift/origin/blob/master/pkg/build/controller/factory/factory.go#L575-L613

(we go through all the builds, find the pods that are supposed to be associated with them, check if the pod actually exists, and if it doesn't exist, we manufacture a pod deletion event so that the HandlePodDeletion logic will get called)

@csrwng
Copy link
Contributor Author

csrwng commented Mar 22, 2017

Of course, you want to make me dig through more code :)

Let me take a look and see if we still need that logic. I know that nothing like that is done in the deployment controllers, but they do a lot of things differently than builds.

@bparees
Copy link
Contributor

bparees commented Mar 22, 2017

Let me take a look and see if we still need that logic. I know that nothing like that is done in the deployment controllers, but they do a lot of things differently than builds.

i'm willing to bet that either

  1. they don't handle the scenario
    or
  2. for some reason it's not applicable to deployments

fundamentally we need to be able to deal w/ the possibility that our controller was down when the pod deletion event fired. when we come up, nothing is going to send us that event (the object is gone, there's no history of it, etc), so we need to somehow determine "hey, there's a pod missing, it must have been deleted, we need to do something about that"

go wait.Until(c.worker, time.Second, stopCh)
<-stopCh
glog.Infof("Shutting down build pod controller")
c.queue.ShutDown()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For historical reasons, we should update those controllers to conform to the latest trends. We defer everywhere upstream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx will update


// Wait for the build store to sync before starting any work in this controller.
ready := make(chan struct{})
go c.waitForSyncedStores(ready, stopCh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some machinery coming up that handles some of this stuff - you may want to look at how this is implemented in upstream controllers, eg. https://github.com/kubernetes/kubernetes/blob/3705358c59dc42e7cdbc2c942845a70547076d2b/pkg/controller/deployment/deployment_controller.go#L149

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, will switch it to do the same

// We must avoid processing build pods until the build and pod stores have synced.
// If they haven't synced, to avoid a hot loop, we'll wait this long between checks.
storeSyncedPollPeriod = 100 * time.Millisecond
maxRetries = 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particaular reason behind this number? I would say if you need to retry something, 10 at most would be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we have in our original controller code... @bparees ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we originally coded it to retry for something like a minute (the rate limiter we used effectively limited you to one retry per second after an initial burst). This primarily is about retrying things where we are waiting on some resource to appear (eg an imagestream or something), not where the call is failing, so we want to give it enough time to potentially appear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are depending on some other resource, you should handle events for it. For example, if you wait for a pod to appear so you can update a build, handling pod events should be enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree that listening on all imagestreams in the cluster to see if the imagestream we need has appeared yet and then go back and figure out what build was waiting for that imagestream and prod it forward (and if we don't get the event, at some point go back and fail the build) is a better solution than retrying on the build event a few times.

I know we've had this debate elsewhere but i stand by my view of that being unnecessary overhead+complexity, especially when 99% of the time the imagestream(or secret or whatever) is going to exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if the object already exists with deletionTimestamp != nil but is not yet removed from etcd, you need to have a check in your ADD handlers that checks those and manually calls DELETE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once your cache is synced, everything that is missing from it, is deleted.

huh? how do i know something is missing from it? the item won't be in etcd and it won't be in the cache, because it was deleted while we were down and when we came up, we synced our cache from etcd (which again, does not contain the item)

Also if the object already exists with deletionTimestamp != nil but is not yet removed from etcd, you need to have a check in your ADD handlers that checks those and manually calls DELETE.

why? won't we still get a real delete event when the item is actually deleted from etcd? and of course if we had reconcile logic, we'd catch it then too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do i know something is missing from it?

You have a build A that depends on a pod A-pod. The build controller is going down, A-pod is getting deleted for some reason. The build controller comes up, it reconciles all builds, sees that for A -> A-pod does not exist. This also depends on the state of the Build, for example, you know it was deleted because the Build is Running. You probably want to recreate the Pod if the Build is New or Pending.

why? won't we still get a real delete event when the item is actually deleted from etcd? and of course if we had reconcile logic, we'd catch it then too.

I think you simply don't want to handle that event as ADD, not sure about the timing the DELETE is propagated from etcd (marked as deleted vs really deleted). You also need the logic to exist in the reconcile loop eg. for avoiding actions like adoption.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw today we don't even handle DELETEs for ImageStreams:

streamInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: c.addImageStream,
UpdateFunc: c.updateImageStream,
})
and I am not sure we need to.

@csrwng
Copy link
Contributor Author

csrwng commented Mar 22, 2017

fundamentally we need to be able to deal w/ the possibility that our controller was down when the pod deletion event fired

Yup, it looks like I need to add a handler for build events and enqueue the deleted event when I get one and there's no corresponding pod.

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 23, 2017
}

if strategy.IsFatal(err) {
glog.V(3).Infof("Will not retry fatal error for pod %s/%s: %v", pod.Namespace, pod.Name, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe V(2)

return
}

glog.V(3).Infof("Giving up retrying pod %s/%s: %v", pod.Namespace, pod.Name, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe V(2)


func (bc *BuildPodController) getBuildForPod(pod *kapi.Pod) (*buildapi.Build, bool, error) {
if !isBuildPod(pod) {
return nil, false, errors.New("cannot get build for pod: pod is not a build pod")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check seems redundant since you only requeue build pods from the resource handlers. Can you at least print the name of the pod in the error?

@ncdc
Copy link
Contributor

ncdc commented Mar 23, 2017

Note that once the 1.6 kube rebase lands, you'll be able to use generated listers and shared informers for our api groups (the new groups, not the legacy ones).

@csrwng csrwng force-pushed the build_shared_informer branch from 053a7dd to 432c2c0 Compare March 23, 2017 21:57
@csrwng
Copy link
Contributor Author

csrwng commented Mar 23, 2017

@bparees @Kargakis addressed your comments so far

glog.V(2).Infof("Failing build %s/%s because the pod has no containers", build.Namespace, build.Name)
nextStatus = buildapi.BuildPhaseFailed
} else {
for _, info := range pod.Status.ContainerStatuses {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the pod is marked as succeeded then all its containers succeeded, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was specifically added for a strange case where the pod was marked as succeeded but it had no containers.

buildListerUpdater := buildclient.NewOSClientBuildClient(oc)
c := &BuildPodController{
buildUpdater: buildListerUpdater,
secretClient: kc.Core(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a TODO to switch this to a cache lookup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, TODO added

@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 23, 2017
return
}

go wait.Until(c.worker, time.Second, stopCh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running a single worker for all builds pods in the cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so far we have only run one... @bparees would you be ok with increasing to 5?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any risk of race conditions when we start processing events in parallel? especially around the serial build policy?

Copy link
Contributor

@bparees bparees Mar 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(could have one thread processing a "pod is now running" event when another thread gets a "pod is complete" event and the second thread could finish its work first, for example?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any, but presumably we have the same risks today in a ha setup, no? And we test for that in integration when running multiple controllers in parallel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache is thread-safe and when you process a key, nobody else can process it at the same time. Or at least that is my impression (not deeply familiar with the internals). cc: @deads2k
I am definitely not sure what happens in a HA setup but that is not related to having a single or multiple goroutines here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workqueue ensures that only one key with the value "X" is processed at a time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k @csrwng w/o looking at the code: i assume that's based on us saying when we're "done" with work item... does our controller code not mark an item "done" until we've finished all processing related to it, or are we marking it done as soon as we pop it off the queue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csrwng cool.


pod, err := bc.getPodByKey(key.(string))
if err != nil {
glog.Error(err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should replace all instances of glog.Errorf with utilruntime.HandleError (it's rate-limited and generally accepted over glog.Errorf)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, will do

@csrwng csrwng force-pushed the build_shared_informer branch from a921cb2 to f379fb0 Compare March 24, 2017 22:01
@csrwng
Copy link
Contributor Author

csrwng commented Mar 24, 2017

addressed latest comments

build.Status.Message = ""
nextStatus = buildapi.BuildPhaseRunning

case kapi.PodPending:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move Pending before Running

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Order doesn't matter unless you're only talking about visual appearance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I just felt it's annoying not to handle these cases in their logical order. Anyway a nit

obj, exists, err := bc.podStore.Indexer.GetByKey(key)
if err != nil {
glog.Infof("Unable to retrieve pod %q from store: %v", key, err)
bc.queue.Add(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be ratelimited


func (bc *BuildPodController) addBuild(obj interface{}) {
build := obj.(*buildapi.Build)
bc.checkBuildPodDeletion(build)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a build is created, its pod doesn't exist but we'll end up triggering a deletion event for the pod, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csrwng csrwng force-pushed the build_shared_informer branch from f379fb0 to 7e4344f Compare March 25, 2017 15:17
@deads2k
Copy link
Contributor

deads2k commented Mar 27, 2017

If you're not going to do that, you're just making life miserable for anyone who's going to try to build on top of such a system. I can't believe read-through on cache miss is not part of the design.

Huh? I suspect we're talking about two different caches now. I was talking about the server-side GET-cache which is known to give stale information during tests upstream.

I think you may be talking about a generated lister, which is different.

@bparees
Copy link
Contributor

bparees commented Mar 27, 2017

Huh? I suspect we're talking about two different caches now. I was talking about the server-side GET-cache which is known to give stale information during tests upstream.

i'm really talking about both the shared informer caches and this get-cache you're saying upstream is moving towards. As a client component, we need a way to say "does this object exist?" and be able to trust the answer when the system says "no, i have no record of that object".

What you and @Kargakis seem to be saying is "you can never trust that answer, you will never be able to trust that answer, and the best you can do is wait an infinite amount of time and if after the universe ends, the object still doesn't appear to exist in any caches, then maybe you can assume it has actually been deleted as opposed to just not having made it into a cache yet"

With that behavior, it is impossible for us to safely implement the functionality of "user deleted the build pod, we need to fail the build now" because if we miss the initial pod-delete event, we have no way to reconcile the state because we can never know for sure if the pod has been deleted, or simply not made it into whatever cache being we're forced to query (or are indirectly querying because the api server is serving cached values w/o doing a read-through on miss), meaning we will always be at risk of failing a build that we shouldn't have failed.

@deads2k
Copy link
Contributor

deads2k commented Mar 27, 2017

because if we miss the initial pod-delete event,

I haven't read the code in detail. How would you miss a delete event?

If all you really want is a live read of a pod on a "not found", feel free to do it. It doesn't seem necessary to me, but its not like it would be a common case.

@bparees
Copy link
Contributor

bparees commented Mar 27, 2017

I haven't read the code in detail. How would you miss a delete event?

controller gets restarted after being down/wedged when the delete occurred would be one way.

@0xmichalis
Copy link
Contributor

the best you can do is wait an infinite amount of time

This is inaccurate. Most of the time, the lag between etcd and the controller cache is minimal (seconds at most) and a couple of retries will suffice. You can always do a live GET once your retries are up but it would be rarely used if not at all depending on the amount of retries.

@0xmichalis
Copy link
Contributor

fwiw, here is the fix for the deployment controller: #13550
@mfojtik ^

@bparees
Copy link
Contributor

bparees commented Mar 27, 2017

This is inaccurate. Most of the time, the lag between etcd and the controller cache is minimal (seconds at most) and a couple of retries will suffice. You can always do a live GET once your retries are up but it would be rarely used if not at all depending on the amount of retries.

@deads2k indicated even the live GET may not be an actual get against etcd at some point in the future, due to caches being introduced upstream (and/or moving to a replicated etcd model w/o quorum reads enabled).

In @csrwng's testing, most of the time this doesn't happen at all (the cache is populated fast enough), so this isn't really about the "mainline" scenario, it's about handling the long tail cases. So yes, we can retry the cache a few times before going to the raw get (ignoring the aforementioned concerns that even a raw get won't give us the truth), but from what we've seen so far, it's not even worth retrying it, we can just do the raw get on the first cache miss since it will be relatively infrequent (and the current code is always doing a raw get today, and it hasn't shown up as a performance issue).

@csrwng csrwng force-pushed the build_shared_informer branch 2 times, most recently from 11389d2 to dec1b07 Compare March 29, 2017 19:06
@csrwng
Copy link
Contributor Author

csrwng commented Mar 29, 2017

I added a commit to queue the 'pod not found' events and retry them using the cache. If still not found in the cache after 3 retries it tries the api client directly. If still not found, then it marks the build as failed.

@bparees bparees self-assigned this Mar 31, 2017
Copy link
Contributor

@bparees bparees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plus the open question about whether we need to handle pod ADD events.

return
}
if err := runPolicy.OnComplete(build); err != nil {
glog.Errorf("failed to run policy on completed build: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be calling handleerror on both of these errors.

// no containers in the pod means something went badly wrong, so the build
// should be failed.
glog.V(2).Infof("Failing build %s/%s because the pod has no containers", build.Namespace, build.Name)
nextStatus = buildapi.BuildPhaseFailed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be setting a build status reason/message here.


// The only error that can currently be returned is a NotFound error, but adding a check
// here just in case that changes in the future
if err != nil && errors.IsNotFound(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add some comments explaining why we're doing this (cache latency)

glog.V(2).Infof("Will not retry fatal error for pod %s/%s: %v", pod.Namespace, pod.Name, err)
bc.queue.Forget(key)
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any way to throw a fatal error in this code? I'm not seeing one.

what conditions would we want to consider as fatal for handling a pod event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point there are no fatal errors, but we will need this logic once we merge this with the build controller which does have a fatal error.

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 31, 2017
@csrwng csrwng force-pushed the build_shared_informer branch from dec1b07 to ea67bc2 Compare April 3, 2017 21:32
Changes:
- Creates a new controller under a buildpod package that follows
conventions in deployment controllers.
- Refactors common controller functions into a new controller/common
package for use by existing and new controllers.
- Adds informer and lister for builds
- Removes previous controller code.
@csrwng csrwng force-pushed the build_shared_informer branch from ea67bc2 to 18036b7 Compare April 3, 2017 21:50
@csrwng
Copy link
Contributor Author

csrwng commented Apr 3, 2017

latest comments addressed

@csrwng
Copy link
Contributor Author

csrwng commented Apr 3, 2017

[testextended][extended:core(builds)]

@openshift-bot
Copy link
Contributor

Evaluated for origin testextended up to 18036b7

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/testextended SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended/116/) (Base Commit: 3b82ca6) (Extended Tests: core(builds))

@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 4, 2017
@csrwng
Copy link
Contributor Author

csrwng commented Apr 4, 2017

Flake #11114
[test]

@csrwng
Copy link
Contributor Author

csrwng commented Apr 4, 2017

Flake #12236
[test]

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 18036b7

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/551/) (Base Commit: f0670df)

@bparees
Copy link
Contributor

bparees commented Apr 4, 2017

lgtm

@csrwng
Copy link
Contributor Author

csrwng commented Apr 4, 2017

[merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 18036b7

@openshift-bot
Copy link
Contributor

openshift-bot commented Apr 4, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/244/) (Base Commit: 6a175fa) (Image: devenv-rhel7_6115)

@openshift-bot openshift-bot merged commit 7c91f63 into openshift:master Apr 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants