Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new liveness probe for router pod #12846

Closed
wants to merge 1 commit into from

Conversation

louyihua
Copy link
Contributor

@louyihua louyihua commented Feb 7, 2017

To ultimately prevent bug 1405440, a new, implemnt-independent HTTP-Get health check is introduced by this PR. This new health check is provided by the openshift-router process itself. For HAProxy-based router, the health check uses HAProxy's CLI in stats socket for liveness probe, and /healthz for readiness probe.

@louyihua louyihua changed the title A new liveness probe for router pod [DO NOT MERGE] A new liveness probe for router pod Feb 7, 2017
@louyihua
Copy link
Contributor Author

louyihua commented Feb 8, 2017

@jeremyeder I think we can contiune discuss about the liveness probe of router pod here.
As you said in #12716 , you can set lower values of maxconn in frontend sections while keep a high value of maxconn in the global section to make sure that the /healthz endpoint will always have available connection resources. This is correct theoretically, but I think it is not easy to achieve in the current configuration.

As you can see, currently the haproxy will handle FIVE frontends:

  • two public frontends (HTTP & HTTPS)
  • two intermedia HTTPS frontends
  • the stats frontend

If we use Mg to denote the global maxconn, Mt to denote the stats maxconn, Mh and Ms to denote the public HTTP & HTTPS frontends respectively, we can easily found that the equation Mt + Mh + Ms * 2 < Mg should be satisfied. Here, Ms * 2 is a upper bound, which means that there are no passthough routes configured in this router, and each connection should be counted as two connections as each of them will pass the intermedia frontend. Although the equation is simple, the really hard part is to find the correct value of Mh and Ms, as the number of incoming traffic varies time to time. And a misconfiguration of Mh and Ms causes either resource waste or service interruption.

Furthermore, I think a liveness probe's responsibility is to check whether a process is alive or dead, and it is the readiness probe's responsibility to check whether a alive process works correctly or not -- surely the /healthz endpoint should belong to the latter one. And in current configuration, the /healthz endpoint will always return 200, as there is no failure condition configured, which means even other frontends reach their connection limit, the pod's status cannot reflect such condition.

So, rather than relying on these hard-to-decide values to keep the stats endpoint always available, it may be much better to find another way that has no relation with this limit as the pod's liveness probe, while keeping the /healthz endpoint as the readiness probe. And using this way, we can have a pretty good side-effect: by just confiuring the global's maxconn and default's maxconn, the pod's readiness state reflects whether this router reaches its connection limit or not!

@smarterclayton
Copy link
Contributor

This probe seems way more complex. /healthz is almost always liveness, not readiness (we specifically expose other readiness probes for that).

@smarterclayton
Copy link
Contributor

Exec is also much heavier than an http endpoint. Also, other platforms that call into the router can't easily use that check and might rely on healthz. I would much prefer a solution that makes healthz obey different connection limits.

@louyihua
Copy link
Contributor Author

louyihua commented Feb 10, 2017

@smarterclayton
As the openshift platform cannot distinguish the connection timeout to /healthz is caused by not loaded or overloaded, using this endpoint for both liveness and readiness is not so suitable.
Maybe it should be the openshift-router or another agent inside the pod that provides health check endpoints rather than haproxy. In this case, this agent can check whether the haproxy's process is in a normal state or not started through other ways, then it can combine such information with the timeout result from monitor-uri returned by haproxy to figure out what caused such timeout (overloading? not started? or any other errors?).
Through this way, the router can still provides its health checks using HTTP endpoints, and it may not be affected by the connection limits.

@louyihua louyihua changed the title [DO NOT MERGE] A new liveness probe for router pod A new liveness probe for router pod Feb 16, 2017
@louyihua louyihua force-pushed the router-probe-fix branch 3 times, most recently from 734ccfb to be23bb9 Compare February 16, 2017 01:46
@louyihua
Copy link
Contributor Author

louyihua commented Feb 16, 2017

@smarterclayton It may be difficult to make /healthz obey different connection limit rules, as it resides in normal HAProxy's proxy rather than a special one.
I've proposed a health check that is provided by the openshift-router process. This provides more flexisibility, as it gives a consistent health check endpoint for platform & other apps, no matter what the underlying implement is used. For current HAProxy-based router, the liveness probe uses HAProxy's stats socket, while the readiness probe uses current /healthz endpoint.

@louyihua louyihua force-pushed the router-probe-fix branch 4 times, most recently from 21a766a to 4f49dce Compare February 16, 2017 07:34
@jeremyeder
Copy link
Contributor

Perhaps the right thing to do is experiment patching haproxy to special-case /healthz and use that as a starting point for more deeper discussions with haproxy upstream.

@louyihua
Copy link
Contributor Author

@jeremyeder
I've experimented on HAProxy's monitor-uri for a while, and there are what I found:

  1. The monitor-uri (which defines /healthz) works in a per-frontend basis, which means each frontend can have its own failure conidition and so that there is NO single global health check for HAProxy.
  2. The maxconn limit works at a VERY EARLY STAGE (actually, just after a socket is accepted/created), while the monitor-uri works at HTTP level.
    Such gap makes it very difficult to do a special case for what defined in monitor-uri (like /healthz): you know, if we want to do a special case for /healthz, you must duplicate the check to many places as the /healthz is supported only in HTTP frontend while the connection limit works at every frontend. And such movement may also breaks what maxconn indicates: HAProxy deals with no more sockets than the number calculated by maxconn.
    And there maybe a consideration in HAProxy's author: a success /healthz indicates the frontend is normal, a failure /healthz indicates the frontend is fail, and a failure connection to /healthz indicates the fronted exhausted its connection limit.
    And there is also another consideration, if /healthz does not obey the default connection limit rule, should there be a special rule for it? If no, it will be a weak point as malicious connections can easily exhausted the server's resource by making a large number of connections to /healthz. If yes, it may introduce more complexity in configuration.
    Based on the above, I'm really doubt that whether the upstream will accept such change (make /healthz not obey the connection limit rule) or not. And I keep my mind that we should not just use /healthz as the liveness probe not just because it is difficult to make it not obey the connection limit rule, but also because it is only a per-frontend check rather than a global check, while a liveness probe SHOULD be a global check of an entire pod rather than a single frontend.

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 17, 2017 via email

@louyihua
Copy link
Contributor Author

My preliminary proposal contains the following changes in the updated PR :

  1. Open an HTTP endpoint (which is configurable through a command line parameter or environment variable) that listens incoming health checks using http.ListenAndServe in router controller, which is inherited by all types of router.
  2. Add a HandleProbe method in the interface of router plugin, so that each plugin (not just the underlying router implementation) has the ability to report the health state if necessary.
  3. For current HAProxy router implementation, it uses the stats socket for liveness probe (this is a unix socket not affected by the maxconn limit) and the /healthz endpoint for readiness probe. For the F5 router, now it just returns OK for all probes, but more checks can be added if necessary.

@jeremyeder
Copy link
Contributor

jeremyeder commented Feb 20, 2017

@jmencak @openshift/networking PTAL?

Copy link
Contributor

@pecameron pecameron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a openshift-docs PR that corresponds with these changes?


.PP
\fB\-\-probe\-timeout\fP="1s"
The timeout that router waits for underlying implementation to reply a probe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reply to a probe (add the "to")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for correction.

if cfg.HostNetwork {
probe.Handler.HTTPGet.Host = "localhost"
}
// Workaround for misconfigured environments where the Node's InternalIP is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this information in the openshift-docs documentation? If its new what is the docs PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the comment here? It has been here for a while and it is just an indentation change here in this PR.

probe.InitialDelaySeconds = 10
}
return probe
return generateProbeConfigForRouter(cfg, ports, "/alive", 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a 10 sec delay sufficient? Should this be configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a default thing that has been used for a while.
If not good for some users, further customization can be through editing the router's DC.

@@ -73,6 +77,8 @@ func (o *RouterSelection) Bind(flag *pflag.FlagSet) {
flag.BoolVar(&o.AllowWildcardRoutes, "allow-wildcard-routes", cmdutil.Env("ROUTER_ALLOW_WILDCARD_ROUTES", "") == "true", "Allow wildcard host names for routes")
flag.BoolVar(&o.DisableNamespaceOwnershipCheck, "disable-namespace-ownership-check", cmdutil.Env("ROUTER_DISABLE_NAMESPACE_OWNERSHIP_CHECK", "") == "true", "Disables the namespace ownership checks for a route host with different paths or for overlapping host names in the case of wildcard routes. Please be aware that if namespace ownership checks are disabled, routes in a different namespace can use this mechanism to 'steal' sub-paths for existing domains. This is only safe if route creation privileges are restricted, or if all the users can be trusted.")
flag.BoolVar(&o.EnableIngress, "enable-ingress", cmdutil.Env("ROUTER_ENABLE_INGRESS", "") == "true", "Enable configuration via ingress resources")
flag.StringVar(&o.ProbeEndpoint, "probe-endpoint", cmdutil.Env("ROUTER_PROBE_ENDPOINT", "0.0.0.0:1935"), "The http endpoint that router listens on for accepting incoming probes")
flag.StringVar(&o.ProbeTimeoutStr, "probe-timeout", cmdutil.Env("ROUTER_PROBE_TIMEOUT", "1s"), "The timeout that router waits for underlying implementation to reply a probe")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reply "to" a probe

@eparis
Copy link
Member

eparis commented Feb 21, 2017

I'm still not sure what I think about the whole idea. @smarterclayton can you give a minute of thought here to tell me I'm wrong?

We are going to have a probe which can pass, even when the router is not doing its job. Its job, Its one and only job, is to serve on port 80. The bug is that when the router is overworked and IS NOT SERVING on port 80 that the probe is failing.

Is the problem really the probe? If so, this PR makes sense to me.

Or is the problem how we REACT to the probe?

Is the real problem that when the router is overworked it gets killed and restarted which may only compound the problem? Should we be looking for better way so react?

Maybe the solution is somewhere in the middle... Vertical autoscaling under pressure? A probe which only checks stats is port 80 is failing, and only then if the stats tell us that it is working?

This whole issues (not just this PR) really rubs me the wrong way, but I still haven't figured out 'the right way'. I just feel sure we looking at it wrong.

@rajatchopra
Copy link
Contributor

@eparis The problem is with the reaction to the probe. But the current probe does not tell us much - has the pod failed? No. Is the pod overworked that it cannot even handle the probe response? Yes.
How do we distinguish?

This PR does not do much to solve the reaction to the old probe response, but re-organizes the code such that we can write specific probe responses when we know what we want to do. The liveness probe is certainly better even if the readiness probe is the same old answer.

To ultimately prevent bug 1405440, a new, implemnt-independent HTTP-Get
health check is introduced by this PR. This new health check is provided
by the `openshift-router` process itself. For HAProxy-based router, the
health check uses HAProxy's CLI in stats socket for liveness probe, and
`/healthz` for readiness probe.
@pecameron
Copy link
Contributor

@eparis what problem are we trying to fix here? Sounds like when the system gets resource constrained, things slow down, so what is the real bottleneck?

At the least we should have a test that demonstrates that the fix works. Load haproxy to long response times and verify the health probe still returns quickly enough. Changes in the name of performance need very careful testing.

@louyihua
Copy link
Contributor Author

@eparis The motivation of this fix comes from the BUG 1405440, which points that: the number of active connections reaches HAProxy's global connection limit, HAProxy will also refuse to answer the /healthz endpoint, which causes router pod be repeatedly restarted because openshift use this endpoint as the liveness probe. (for a liveness probe, if it fails, what else we can do but restart the pod?) However, if we just want to solve this problem itself, we have several options:

  1. We can raise the HAProxy's global connection limit which won't be reached easily, or
  2. We can think a way to make this endpoint not obey the connection limit, or
  3. We can use different probes.
    Option 1 is easy, but far from good. One reason is that, although we can raise the limit to a very high number which seems unreachable even in extrem situations, the router's role cannot afford the side effect it brings: before the connection limit is exhausted, other system resource limits (memory, CPU, ...) are reached. When this happens, not only new connections cannot come in, even existing connections may be affected (such as slow responding or even packet loss). I think this is not what we want to see.
    Option 2 seems good, but very hard to implement. I've investigated HAProxy's code, and find that making a HTTP endpoint not obey the connection limit rule requires break to current HAProxy's code structure. And, using /healthz as both liveness and readiness probe is not a good option, as @rajatchopra said, /healthz just returns ture or false and can't provide any information to tell what happens.
    Since the above two options both have great limitations, only option 3 is left. I first tried to propose a shell probe, but such probe seems no good than /healthz. Then, I think that, why not use openshift-router to provide the probe endpoint? In this way, we can provide a general way of probe, not only to just solve the above BUG, but also we can do much more: for example, we can answer liveness and readiness probes through different criteria, which can be fully adjusted according to our requirements.

And, @pecameron
I don't think we need to make too much load to HAProxy to get it slow down. If HAProxy holds too much load, not only new incoming connections are affected, but also exsting connections may also be affected, which is not what we want. I think what we need to do is: we decide a reasonable connection limitation for HAProxy such as the limitation we set in a real production environment, than we make the number of active connections reaches such limit, so that any new incoming requests can't be served, but existing connections are not much affected. Then, in such a situation, we test whether our probe (liveness & readiness) can response quickly and correctly.

@ramr
Copy link
Contributor

ramr commented Feb 22, 2017

Just adding me 2 cents here. The main aim here is to check for liveness that the haproxy process (or the router pod really) is alive. So from a certain perspective, connectivity to the haproxy process does also signal liveness. Maybe the simpler approach is to have the liveness probe use TCPSocketAction and just verify connectivity to the stats port rather than try and send an HTTP request. We could still use the HTTP action/request for the readiness probe but that's at startup (or close to it) time. At steady state, the tcp check is probably enough and a wee bit less invasive.

@louyihua
Copy link
Contributor Author

For HAProxy, as the connection limit is checked after socket accept in HAProxy, the TCPSocketAction should be succeeded even if connection limit is reached. So @ramr is right, using TCPSocketAction as the liveness probe will be the easist way for now.
But if we want to support other types of software router (like nginx) in future, maybe considering a more general, flexible and implement-independent probe should also be necessary.

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 26, 2017
@louyihua louyihua mentioned this pull request Feb 27, 2017
@openshift-bot openshift-bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 20, 2017
@openshift-bot
Copy link
Contributor

Origin Action Required: Pull request cannot be automatically merged, please rebase your branch from latest HEAD and push again

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 25, 2017
@knobunc
Copy link
Contributor

knobunc commented May 31, 2017

Closing this based on the previous conversation.

@knobunc knobunc closed this May 31, 2017
@louyihua louyihua deleted the router-probe-fix branch June 19, 2017 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. priority/P2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants