-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use shared informer in BuildPodController and BuildPodDeleteController #13510
Conversation
[test] |
@bparees ptal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there still a podDeleteLW that is no longer used?
And how is the logic podDeleteLW implemented managed by the shared informers? (specifically, podDeleteListWatcher will, on List, create fake pod delete events for missing build pods, as part of the reconciliation logic)
// NewBuildPodController creates a new BuildPodController. | ||
func NewBuildPodController(buildInformer, podInformer cache.SharedIndexInformer, kc kclientset.Interface, oc osclient.Interface) *BuildPodController { | ||
buildUpdater := buildclient.NewOSClientBuildClient(oc) | ||
buildLister := buildclient.NewOSClientBuildClient(oc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason we can't use the same object for both of these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no reason, will collapse
obj, exists, err := bc.podStore.Indexer.GetByKey(key) | ||
if err != nil { | ||
glog.Infof("Unable to retrieve pod %q from store: %v", key, err) | ||
bc.queue.Add(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this going to hot loop or does the queue have some rate-limiting built into it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the queue is a rate-limiting thing:
https://github.com/openshift/origin/pull/13510/files#diff-94fa474175901a5802dadd44c4f318caR28
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are not adding with a rate limit here though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh by the way the only error you can get from the cache is a NotFound, fyi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, yes it should be AddRateLimited
return nil, err | ||
} | ||
if !exists { | ||
glog.Infof("Pod %q has been deleted", key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we really be confident of this? there may be cases where the pod cache is out of date and doesn't yet contain the build pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we wouldn't have the key in the queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah ok i was thinking of a different code flow where we get a build event, lookup the buildpod annotation value, and then try to retrieve the build pod by that key. In that case, it's a problem to query a cache for the build pod because it could result in a cache miss and a false result that the pod doesn't exist when it does.
(and the inverse case, where we get a pod event, look up the build associated w/ the pod and again, could have a cache miss if the cache is stale)
@@ -0,0 +1,213 @@ | |||
package buildpod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not obvious to me why we have a "factory" here, seems like all this content should just be moved into controller.go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to keep the object/event processing code separate from the setup/creation code. We don't need to call it factory ... maybe infra.go ? The deployment controllers still call it factory, which is why I reused the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced separating that is necessary or even desirable. I generally like it when all my object methods are in the same file as the object definition itself. We don't create separate files for all our other New*() functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is common in go to have methods of a single struct span multiple files.
But I don't feel strongly one way or the other, so will collapse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is common in go to have methods of a single struct span multiple files.
and maybe someday someone will convince me that's a benefit :)
(showing my java predilection, i know)
return | ||
} | ||
|
||
if bc.queue.NumRequeues(key) < maxRetries { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when does NumRequeues get reset? we could see the same key multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets reset when we process it successfully
if !isBuildPod(pod) { | ||
return nil, false, errors.New("cannot get build for pod: pod is not a build pod") | ||
} | ||
build, err := bc.buildStore.Builds(pod.Namespace).Get(buildutil.GetBuildName(pod)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming this is also a cache, same question here about cases where the cache is stale and the build really does exist despite not being in the cache yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question ... today we simply say it's not found and keep on going:
https://github.com/openshift/origin/blob/master/pkg/build/controller/controller.go#L295-L298
It's hard to imagine a situation where we're seeing the pod, but the build is not available, given that the build had to be in a cache first to generate the pod. It's true that until we don't use the single informers for all controllers, these are going to be different caches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the concern here would be if someone deletes a pod for a running build and we go to look up the build for that pod and can't find it, we will not properly mark the build as "failed/error: pod was deleted"
as you say, probably unlikely in this particular flow since we have to have seen the build at some point to have created the pod, so maybe we can live with it, but for all these cases where we are looking up other resources in caches we need to think about the implications of a stale cache/race. I really wish we had a read-through cache option that would hit the api server if the item isn't found in the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the concern here would be if someone deletes a pod for a running build and we go to look up the build for that pod and can't find it, we will not properly mark the build as "failed/error: pod was deleted"
Yes we will because we will handle the pod deleted event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we will because we will handle the pod deleted event
?
handling the pod deleted event properly requires that we are able to find the build associated w/ the pod that we got a delete event for. If we can't find the build because it's not in the cache yet(and we don't check the api server directly), we can't handle the event properly, because handling the event properly requires updating the build object.
again, i think you've convinced me it's not possible for the build not to be in the cache in this particular case, but the general class of problem is real.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We collapsed the deployer pod controller into the deployments controller and moved the pod lookup to be a cache lookup w/o any major problem. The change was 1.3 or 1.4, can't remember.
The delete controller has been replaced by the event listener on the informer: I will remove the podDeleteLW struct, forgot to remove. |
how does that handle reconciliation? (ie what happens if we miss the delete event? what resends it?) |
The shared informer is using the same deltafifo we were using under the covers: |
that doesn't handle reconciliation (it really can't). this is how we handle reconciliation: (we go through all the builds, find the pods that are supposed to be associated with them, check if the pod actually exists, and if it doesn't exist, we manufacture a pod deletion event so that the HandlePodDeletion logic will get called) |
Of course, you want to make me dig through more code :) Let me take a look and see if we still need that logic. I know that nothing like that is done in the deployment controllers, but they do a lot of things differently than builds. |
i'm willing to bet that either
fundamentally we need to be able to deal w/ the possibility that our controller was down when the pod deletion event fired. when we come up, nothing is going to send us that event (the object is gone, there's no history of it, etc), so we need to somehow determine "hey, there's a pod missing, it must have been deleted, we need to do something about that" |
go wait.Until(c.worker, time.Second, stopCh) | ||
<-stopCh | ||
glog.Infof("Shutting down build pod controller") | ||
c.queue.ShutDown() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defer this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kargakis it doesn't get deferred in any of the deployment controllers...
https://github.com/openshift/origin/blob/master/pkg/deploy/controller/deployment/factory.go#L83
https://github.com/openshift/origin/blob/master/pkg/deploy/controller/deploymentconfig/factory.go#L91
https://github.com/openshift/origin/blob/master/pkg/deploy/controller/generictrigger/factory.go#L76
what would be the reason for doing it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For historical reasons, we should update those controllers to conform to the latest trends. We defer everywhere upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx will update
|
||
// Wait for the build store to sync before starting any work in this controller. | ||
ready := make(chan struct{}) | ||
go c.waitForSyncedStores(ready, stopCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some machinery coming up that handles some of this stuff - you may want to look at how this is implemented in upstream controllers, eg. https://github.com/kubernetes/kubernetes/blob/3705358c59dc42e7cdbc2c942845a70547076d2b/pkg/controller/deployment/deployment_controller.go#L149
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, will switch it to do the same
// We must avoid processing build pods until the build and pod stores have synced. | ||
// If they haven't synced, to avoid a hot loop, we'll wait this long between checks. | ||
storeSyncedPollPeriod = 100 * time.Millisecond | ||
maxRetries = 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particaular reason behind this number? I would say if you need to retry something, 10 at most would be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what we have in our original controller code... @bparees ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we originally coded it to retry for something like a minute (the rate limiter we used effectively limited you to one retry per second after an initial burst). This primarily is about retrying things where we are waiting on some resource to appear (eg an imagestream or something), not where the call is failing, so we want to give it enough time to potentially appear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are depending on some other resource, you should handle events for it. For example, if you wait for a pod to appear so you can update a build, handling pod events should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree that listening on all imagestreams in the cluster to see if the imagestream we need has appeared yet and then go back and figure out what build was waiting for that imagestream and prod it forward (and if we don't get the event, at some point go back and fail the build) is a better solution than retrying on the build event a few times.
I know we've had this debate elsewhere but i stand by my view of that being unnecessary overhead+complexity, especially when 99% of the time the imagestream(or secret or whatever) is going to exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if the object already exists with deletionTimestamp != nil but is not yet removed from etcd, you need to have a check in your ADD handlers that checks those and manually calls DELETE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once your cache is synced, everything that is missing from it, is deleted.
huh? how do i know something is missing from it? the item won't be in etcd and it won't be in the cache, because it was deleted while we were down and when we came up, we synced our cache from etcd (which again, does not contain the item)
Also if the object already exists with deletionTimestamp != nil but is not yet removed from etcd, you need to have a check in your ADD handlers that checks those and manually calls DELETE.
why? won't we still get a real delete event when the item is actually deleted from etcd? and of course if we had reconcile logic, we'd catch it then too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do i know something is missing from it?
You have a build A that depends on a pod A-pod. The build controller is going down, A-pod is getting deleted for some reason. The build controller comes up, it reconciles all builds, sees that for A -> A-pod does not exist. This also depends on the state of the Build, for example, you know it was deleted because the Build is Running. You probably want to recreate the Pod if the Build is New or Pending.
why? won't we still get a real delete event when the item is actually deleted from etcd? and of course if we had reconcile logic, we'd catch it then too.
I think you simply don't want to handle that event as ADD, not sure about the timing the DELETE is propagated from etcd (marked as deleted vs really deleted). You also need the logic to exist in the reconcile loop eg. for avoiding actions like adoption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw today we don't even handle DELETEs for ImageStreams:
origin/pkg/deploy/controller/generictrigger/factory.go
Lines 46 to 49 in c35d153
streamInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{ | |
AddFunc: c.addImageStream, | |
UpdateFunc: c.updateImageStream, | |
}) |
Yup, it looks like I need to add a handler for build events and enqueue the deleted event when I get one and there's no corresponding pod. |
} | ||
|
||
if strategy.IsFatal(err) { | ||
glog.V(3).Infof("Will not retry fatal error for pod %s/%s: %v", pod.Namespace, pod.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe V(2)
return | ||
} | ||
|
||
glog.V(3).Infof("Giving up retrying pod %s/%s: %v", pod.Namespace, pod.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe V(2)
|
||
func (bc *BuildPodController) getBuildForPod(pod *kapi.Pod) (*buildapi.Build, bool, error) { | ||
if !isBuildPod(pod) { | ||
return nil, false, errors.New("cannot get build for pod: pod is not a build pod") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check seems redundant since you only requeue build pods from the resource handlers. Can you at least print the name of the pod in the error?
Note that once the 1.6 kube rebase lands, you'll be able to use generated listers and shared informers for our api groups (the new groups, not the legacy ones). |
053a7dd
to
432c2c0
Compare
glog.V(2).Infof("Failing build %s/%s because the pod has no containers", build.Namespace, build.Name) | ||
nextStatus = buildapi.BuildPhaseFailed | ||
} else { | ||
for _, info := range pod.Status.ContainerStatuses { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the pod is marked as succeeded then all its containers succeeded, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was specifically added for a strange case where the pod was marked as succeeded but it had no containers.
buildListerUpdater := buildclient.NewOSClientBuildClient(oc) | ||
c := &BuildPodController{ | ||
buildUpdater: buildListerUpdater, | ||
secretClient: kc.Core(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a TODO to switch this to a cache lookup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, TODO added
return | ||
} | ||
|
||
go wait.Until(c.worker, time.Second, stopCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running a single worker for all builds pods in the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so far we have only run one... @bparees would you be ok with increasing to 5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any risk of race conditions when we start processing events in parallel? especially around the serial build policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(could have one thread processing a "pod is now running" event when another thread gets a "pod is complete" event and the second thread could finish its work first, for example?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of any, but presumably we have the same risks today in a ha setup, no? And we test for that in integration when running multiple controllers in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache is thread-safe and when you process a key, nobody else can process it at the same time. Or at least that is my impression (not deeply familiar with the internals). cc: @deads2k
I am definitely not sure what happens in a HA setup but that is not related to having a single or multiple goroutines here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workqueue ensures that only one key with the value "X" is processed at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bparees we're done when we say we're done :)
https://github.com/openshift/origin/pull/13510/files#diff-94fa474175901a5802dadd44c4f318caR244
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@csrwng cool.
|
||
pod, err := bc.getPodByKey(key.(string)) | ||
if err != nil { | ||
glog.Error(err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should replace all instances of glog.Errorf with utilruntime.HandleError (it's rate-limited and generally accepted over glog.Errorf)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, will do
a921cb2
to
f379fb0
Compare
addressed latest comments |
build.Status.Message = "" | ||
nextStatus = buildapi.BuildPhaseRunning | ||
|
||
case kapi.PodPending: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move Pending before Running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Order doesn't matter unless you're only talking about visual appearance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I just felt it's annoying not to handle these cases in their logical order. Anyway a nit
obj, exists, err := bc.podStore.Indexer.GetByKey(key) | ||
if err != nil { | ||
glog.Infof("Unable to retrieve pod %q from store: %v", key, err) | ||
bc.queue.Add(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be ratelimited
|
||
func (bc *BuildPodController) addBuild(obj interface{}) { | ||
build := obj.(*buildapi.Build) | ||
bc.checkBuildPodDeletion(build) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a build is created, its pod doesn't exist but we'll end up triggering a deletion event for the pod, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f379fb0
to
7e4344f
Compare
Huh? I suspect we're talking about two different caches now. I was talking about the server-side GET-cache which is known to give stale information during tests upstream. I think you may be talking about a generated lister, which is different. |
i'm really talking about both the shared informer caches and this get-cache you're saying upstream is moving towards. As a client component, we need a way to say "does this object exist?" and be able to trust the answer when the system says "no, i have no record of that object". What you and @Kargakis seem to be saying is "you can never trust that answer, you will never be able to trust that answer, and the best you can do is wait an infinite amount of time and if after the universe ends, the object still doesn't appear to exist in any caches, then maybe you can assume it has actually been deleted as opposed to just not having made it into a cache yet" With that behavior, it is impossible for us to safely implement the functionality of "user deleted the build pod, we need to fail the build now" because if we miss the initial pod-delete event, we have no way to reconcile the state because we can never know for sure if the pod has been deleted, or simply not made it into whatever cache being we're forced to query (or are indirectly querying because the api server is serving cached values w/o doing a read-through on miss), meaning we will always be at risk of failing a build that we shouldn't have failed. |
I haven't read the code in detail. How would you miss a delete event? If all you really want is a live read of a pod on a "not found", feel free to do it. It doesn't seem necessary to me, but its not like it would be a common case. |
controller gets restarted after being down/wedged when the delete occurred would be one way. |
This is inaccurate. Most of the time, the lag between etcd and the controller cache is minimal (seconds at most) and a couple of retries will suffice. You can always do a live GET once your retries are up but it would be rarely used if not at all depending on the amount of retries. |
@deads2k indicated even the live GET may not be an actual get against etcd at some point in the future, due to caches being introduced upstream (and/or moving to a replicated etcd model w/o quorum reads enabled). In @csrwng's testing, most of the time this doesn't happen at all (the cache is populated fast enough), so this isn't really about the "mainline" scenario, it's about handling the long tail cases. So yes, we can retry the cache a few times before going to the raw get (ignoring the aforementioned concerns that even a raw get won't give us the truth), but from what we've seen so far, it's not even worth retrying it, we can just do the raw get on the first cache miss since it will be relatively infrequent (and the current code is always doing a raw get today, and it hasn't shown up as a performance issue). |
11389d2
to
dec1b07
Compare
I added a commit to queue the 'pod not found' events and retry them using the cache. If still not found in the cache after 3 retries it tries the api client directly. If still not found, then it marks the build as failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plus the open question about whether we need to handle pod ADD events.
pkg/build/controller/common/util.go
Outdated
return | ||
} | ||
if err := runPolicy.OnComplete(build); err != nil { | ||
glog.Errorf("failed to run policy on completed build: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be calling handleerror on both of these errors.
// no containers in the pod means something went badly wrong, so the build | ||
// should be failed. | ||
glog.V(2).Infof("Failing build %s/%s because the pod has no containers", build.Namespace, build.Name) | ||
nextStatus = buildapi.BuildPhaseFailed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be setting a build status reason/message here.
|
||
// The only error that can currently be returned is a NotFound error, but adding a check | ||
// here just in case that changes in the future | ||
if err != nil && errors.IsNotFound(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add some comments explaining why we're doing this (cache latency)
glog.V(2).Infof("Will not retry fatal error for pod %s/%s: %v", pod.Namespace, pod.Name, err) | ||
bc.queue.Forget(key) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any way to throw a fatal error in this code? I'm not seeing one.
what conditions would we want to consider as fatal for handling a pod event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point there are no fatal errors, but we will need this logic once we merge this with the build controller which does have a fatal error.
dec1b07
to
ea67bc2
Compare
Changes: - Creates a new controller under a buildpod package that follows conventions in deployment controllers. - Refactors common controller functions into a new controller/common package for use by existing and new controllers. - Adds informer and lister for builds - Removes previous controller code.
ea67bc2
to
18036b7
Compare
latest comments addressed |
[testextended][extended:core(builds)] |
Evaluated for origin testextended up to 18036b7 |
continuous-integration/openshift-jenkins/testextended SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended/116/) (Base Commit: 3b82ca6) (Extended Tests: core(builds)) |
Flake #11114 |
Flake #12236 |
Evaluated for origin test up to 18036b7 |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/551/) (Base Commit: f0670df) |
lgtm |
[merge] |
Evaluated for origin merge up to 18036b7 |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/244/) (Base Commit: 6a175fa) (Image: devenv-rhel7_6115) |
Changes:
conventions in deployment controllers.
package for use by existing and new controllers.