-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Differentiate liveness and readiness probes for router pods #19009
Differentiate liveness and readiness probes for router pods #19009
Conversation
@smarterclayton is this a reasonable way to approach the problem? |
@openshift/networking PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
But this is just separating the urls. The logic served on either needs to be differentiated too. So there is going to be a part 2? Or am I missing something?
pkg/oc/admin/router/router.go
Outdated
@@ -436,7 +436,7 @@ func generateSecretsConfig(cfg *RouterConfig, namespace string, defaultCert []by | |||
return secrets, volumes, mounts, nil | |||
} | |||
|
|||
func generateProbeConfigForRouter(cfg *RouterConfig, ports []kapi.ContainerPort) *kapi.Probe { | |||
func generateProbeConfigForRouter(Path string, cfg *RouterConfig, ports []kapi.ContainerPort) *kapi.Probe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
camelCase for 'Path'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Rajat pointed out, need to camelCase the variable. Otherwise, LGTM
@rajatchopra I don't think there will be part 2 here. |
fefb00b
to
1f93335
Compare
pkg/oc/admin/router/router.go
Outdated
@@ -447,7 +447,7 @@ func generateProbeConfigForRouter(cfg *RouterConfig, ports []kapi.ContainerPort) | |||
} | |||
|
|||
probe.Handler.HTTPGet = &kapi.HTTPGetAction{ | |||
Path: "/healthz", | |||
Path: path, | |||
Port: intstr.IntOrString{ | |||
Type: intstr.Int, | |||
IntVal: int32(healthzPort), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though a path is being passed (which is either livez or healthz), the port is always healthz?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, both /livez and /healthz paths are using StatsPort
https://github.com/JacobTanenbaum/origin/blob/1f93335b3f25944c202f7f05e3e52201d07ff972/pkg/cmd/infra/router/template.go#L206 and
https://github.com/JacobTanenbaum/origin/blob/1f93335b3f25944c202f7f05e3e52201d07ff972/pkg/oc/admin/router/router.go#L446
if cfg.StatsPort > 0 { | ||
healthzPort = cfg.StatsPort | ||
probePort = cfg.StatsPort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rajatchopra @pravisankar does this make it clearer? removing the reference to the healthzPort in favour of a more generic term
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
34283c4
to
3b65bf9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
if cfg.StatsPort > 0 { | ||
healthzPort = cfg.StatsPort | ||
probePort = cfg.StatsPort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
pkg/oc/admin/router/router.go
Outdated
@@ -466,15 +465,15 @@ func generateProbeConfigForRouter(cfg *RouterConfig, ports []kapi.ContainerPort) | |||
} | |||
|
|||
func generateLivenessProbeConfig(cfg *RouterConfig, ports []kapi.ContainerPort) *kapi.Probe { | |||
probe := generateProbeConfigForRouter(cfg, ports) | |||
probe := generateProbeConfigForRouter("/livez", cfg, ports) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be /healthz/ready
which is our standard readiness check path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/healthz
is liveness, /healthz/ready
is readiness.
pkg/router/metrics/metrics.go
Outdated
@@ -37,6 +37,10 @@ type Listener struct { | |||
func (l Listener) handler() http.Handler { | |||
mux := http.NewServeMux() | |||
healthz.InstallHandler(mux, l.Checks...) | |||
mux.Handle("/livez", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you actually trying to check for?
This should be following our standard conventions, but you cannot change the meaning of the existing health endpoint without breaking backwards compatibility. Can you describe what the goal is here - have the endpoint available for load balancers prior to the router actually being loaded? Generally if you want that you just use the service field |
@smarterclayton the router added ROUTER_BIND_PORTS_AFTER_SYNC so that external load balancers could have useful health checks on port 80 or 443 for the router to know when it was online. Before we added ROUTER_BIND_PORTS_AFTER_SYNC the router would bind those ports immediately and potentially serve HTTP 503 statuses for valid routes because the route had not been loaded yet. So, we added ROUTER_BIND_PORTS_AFTER_SYNC so that it would not bind 80/443 until a full sync had happened, BUT would bind 1936 so that the liveness probes worked (otherwise the router gets killed). After the refactor to make the router controller return status directly, if you set ROUTER_BIND_PORTS_AFTER_SYNC then you start failing liveness probes. @JacobTanenbaum found out that this was due to the liveness check done by the router controller contacting the haproxies it controlled to see if they were up, and if not, returning false. So, when ROUTER_BIND_PORTS_AFTER_SYNC is set, the haproxy doesn't bind 80 and 443 so the delegated liveness check fails. And the pod gets terminated. So, the goal here is to have an endpoint indicating that the router controller is live, but that it is not ready yet... but we have to work with what we have at the moment:
Before 3.7 /healthz was a liveness check Can we call this a bugfix and add /healthz/ready and move the multiplexer that is on /healthz there? (Should we also support /healthz/ready/backend-http since that's what the multiplexer will set up)? Then /healtz can go back to returning true as soon as the router controller becomes active. |
The liveness check is to prevent a crashed haproxy router from staying dead
due to a route controller bug. The primary purpose is to keep the pod
running. The fundamental behavior of that check didn't change (pre 3.7, if
that failed haproxy was dead, and post 3.7, if that failed haproxy is
dead), but the secondary effect you're referencing did.
No external load balancer should be using /healthz to determine whether to
put the router in rotation - it should be using /healthz/ready or something
equivalent. If we want to health check the router controller we should add
`/healthz/controller` or similar.
When we move to service load balancer for router on cloud providers, the
readiness check for the router service should be /healthz/backend-http (if
you only want to be in rotation if haproxy is listening), and set the
service to preserveUnreadyEndpoints (if you want to be in rotation
regardless of readiness)
…On Tue, Mar 20, 2018 at 10:21 AM, Ben Bennett ***@***.***> wrote:
@smarterclayton <https://github.com/smarterclayton> the router added
ROUTER_BIND_PORTS_AFTER_SYNC so that *external* load balancers could have
useful health checks on port 80 or 443 for the router to know when it was
online. Before we added ROUTER_BIND_PORTS_AFTER_SYNC the router would bind
those ports immediately and potentially serve HTTP 503 statuses for valid
routes because the route had not been loaded yet. So, we added
ROUTER_BIND_PORTS_AFTER_SYNC so that it would not bind 80/443 until a full
sync had happened, BUT would bind 1936 so that the liveness probes worked
(otherwise the router gets killed).
After the refactor to make the router controller return status directly,
if you set ROUTER_BIND_PORTS_AFTER_SYNC then you start failing liveness
probes. @JacobTanenbaum <https://github.com/jacobtanenbaum> found out
that this was due to the liveness check done by the router controller
contacting the haproxies it controlled to see if they were up, and if not,
returning false. So, when ROUTER_BIND_PORTS_AFTER_SYNC is set, the haproxy
doesn't bind 80 and 443 so the delegated liveness check fails. And the pod
gets terminated.
So, the goal here is to have an endpoint indicating that the router
controller is live, but that it is not ready yet... but we have to work
with what we have at the moment:
- /healthz -- Returns only when the backends are ready (changed in 3.7)
- /healthz/backend-http -- Returns only when haproxy is up (added in
3.7)
Before 3.7 /healthz was a liveness check
Can we call this a bugfix and add /healthz/ready and move the multiplexer
that is on /healthz there? (Should we also support
/healthz/ready/backend-http since that's what the multiplexer will set up)?
Then /healtz can go back to returning true as soon as the router
controller becomes active.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19009 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p0bjOks6ht4R6hRcQ67mxP1aMOzDks5tgRBWgaJpZM4SuVWN>
.
|
@smarterclayton The problem is that the external load balancers are looking at 80 and 443 directly to decide whether to put it in rotation. BUT the liveness check defined for the pod is on /healthz and that is checking that the haproxy is live too before returning 200. So when we don't bind to 80 the liveness probes fail. What would you suggest we do to make the liveness check work? I see two options for this
|
3b65bf9
to
2f18e8c
Compare
Sorry for not responding faster, this is on my list to respond to.
…On Thu, Mar 22, 2018 at 1:54 PM, OpenShift CI Robot < ***@***.***> wrote:
New changes are detected. LGTM label has been removed.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19009 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p76NHHtju7Gs2H_Y5ch7qTV9DFheks5tg-VZgaJpZM4SuVWN>
.
|
@@ -66,16 +66,25 @@ func NamedCheck(name string, check func(r *http.Request) error) HealthzChecker { | |||
// exactly one call to InstallHandler. Calling InstallHandler more | |||
// than once for the same mux will result in a panic. | |||
func InstallHandler(mux mux, checks ...HealthzChecker) { | |||
InstallPathHandler(mux, "/healthz", checks...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to change vendor code, can't we do this stuff in openshift code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK changing vendor code needs 'UPSTREAM: ' commit in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pravisankar I did post an upstream PR to accompany this kubernetes/PR63716. The commit that includes the kubernetes code is tagged UPSTREAM:63716, should I add this tag to the PR title?
No as far was we can tell there is no way that we can do this stuff in only openshift. We use InstallHandler for the checks and currently you can only have one set of checks that all have to pass for both liveness and readiness, The changes in vendor allows us to create two sets of checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate UPSTREAM:63716 commit is good, no need to add upstream tag to the PR title.
@JacobTanenbaum Can you fix your description to capture the current behavior please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve |
/assign @deads2k |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -66,16 +66,25 @@ func NamedCheck(name string, check func(r *http.Request) error) HealthzChecker { | |||
// exactly one call to InstallHandler. Calling InstallHandler more | |||
// than once for the same mux will result in a panic. | |||
func InstallHandler(mux mux, checks ...HealthzChecker) { | |||
InstallPathHandler(mux, "/healthz", checks...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate UPSTREAM:63716 commit is good, no need to add upstream tag to the PR title.
/retest |
@deads2k can you approve the upstream commit to vendor/k8s.io/kubernetes/staging/src/k8s.io/apiserver please. (Since kubernetes/kubernetes#63716 landed upstream) Thanks |
/retest |
@deads2k -- can you approve this since the upstream PR has merged. |
…e path to be associated with health checking. Currently it is only possible to have one group of checks which must all pass for the handler to report success. Allowing multiple paths for these checks allows use of the same machinery for other kinds of checks, i.e. readiness. This upstream change allows for the differentiation of health and readiness checks
Add a backend to the router controller "/livez" that always returns true. This differentiates the liveness and readiness probes so that a router can be alive and not ready. Bug 1550007
/approve |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, JacobTanenbaum, knobunc, pravisankar, ramr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -35,6 +36,28 @@ func HTTPBackendAvailable(u *url.URL) healthz.HealthzChecker { | |||
}) | |||
} | |||
|
|||
// HasSynced returns a healthz check that verifies the router has been synced at least | |||
// once. | |||
func HasSynced(router **templateplugin.TemplatePlugin) healthz.HealthzChecker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sort of construct is a violation of our style guides. You should be passing an interface here that exposes the correct check method. Double pointers should never be used. Nothing in metrics should be aware of template plugin at all.
func HasSynced(router **templateplugin.TemplatePlugin) healthz.HealthzChecker { | ||
return healthz.NamedCheck("has-synced", func(r *http.Request) error { | ||
if router != nil { | ||
if (*router).Router.SyncedAtLeastOnce() == true { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This construct is not appropriate. It should always be if booleancondition {
if router != nil { | ||
if (*router).Router.SyncedAtLeastOnce() == true { | ||
return nil | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you return early, elide the else, as per our style guide.
if (*router).Router.SyncedAtLeastOnce() == true { | ||
return nil | ||
} else { | ||
return fmt.Errorf("Router not synced") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errors should always be lower case, as per the style guide.
Create upstream commit that allows for multiple groups of checks to be associated with health checking. Using the multiple groupings differentiate the liveness and readiness probes for the Haproxy router
Bug 1550007