Fix of BUG 1405440 (release-1.5 cherry-pick) #13273

louyihua · 2017-03-07T08:15:54Z

Using TCPSocketAction as the liveness probe, which will not be affected by the connection limit set in HAProxy's config file. This is a TRUE fix for BUG 1405440.

@pweil- @rajatchopra
Cherry-picked from the PR #13121.

Using TCPSocketAction as the liveness probe, which will not be affected by the connection limit set in HAProxy's config file. This is a TRUE fix for BUG 1405440.

pweil- · 2017-03-07T13:47:59Z

@knobunc @eparis please approve this pick. Looks like the referenced bug was high sev/prio and already verified in master.

knobunc · 2017-03-07T14:21:05Z

@pweil- Hmm. This merged while I was away and I may have disputed the original merge. We fixed the original problem by upping the number of connections that were allowed (so I think the high sev part is addressed).

This PR changes the semantic of the health check to just see if the connection is accepted rather than exercising anything more in haproxy. So, if haproxy is wedged elsewhere, you can't know. @eparis, any thoughts?

louyihua · 2017-03-07T14:48:12Z

@knobunc Some discussion about the health check was in #12846. And upping the number of maxconn just REDUCE/DELAY the BUG, not FIX it.

For short, currently used health check actually cannot provide any more in haproxy than this new check. But the old check would fail the pod when maxconn is reached and make all connections lost, while the new check can still success in such situation.

Considering that reaching the connection limit DOES NOT mean the pod fails, while it DOES mean that the pod is NOT READY for new incoming connections, keeping the old check as readiness check while change the liveness check should be a good way to really SOLVE the bug.

Maybe in future we can use better health check as liveness probe to both keep success at high load and provide more information in haproxy. But for now, the TCP connection based check should be a simple but effective way to solve the bug.

knobunc · 2017-03-07T15:07:40Z

See also @eparis' concerns at https://bugzilla.redhat.com/show_bug.cgi?id=1405440#c6

What is the urgency for this particular backport? The change to maxconns is in, and there was some question on the original PR this is cherry-picking about how the new test would perform under load. Have we performed that testing?

louyihua · 2017-03-07T15:34:10Z

For @eparis 's concern, I've checked the haproxy's code, and the 'not responding' is not true.

In fact, when connection limit reaches, haproxy actively disconnect new incoming connections just after it accepts the new socket, so the HTTP health check definitly fails as no HTTP response is generated, but the TCP check should succeed as the socket connection is really accepted. When cocurrent connections drops below the maxconn limit, both HTTP health check and TCP health check will succeed.

So, for the testing problem, what we just need to do is to set maxconn to a very small value (such as 5 or 10), then simulate high load by using telnet to hold as many as connections we set in maxconn, then the difference of these two health checks can be noticed. I've check in my computer through this way.

And, if there is really too much connections that exceeds the capability of haproxy, the TCP check will also fail, but this is the right semantic as only in such situation we should think haproxy is unusable. But this is the case we should prevent, and this is why I disputed the commit that ups the connection limit. Think that, if we just up the limit to prevent liveness probe to fail rather than limit the cocurrent connections to a single router pod, the final destiny will definitely falls into the case that there are too much connections for haproxy to handle. However, if we keep maxconn in a reasonable value, and uses the new liveness probe, we can not only prevent pod fail when connection limit reaches, but also keep cocurrent connections in the range that haproxy can handle well.

knobunc · 2017-03-07T15:36:19Z

@louyihua Ok, thanks! In that case my concerns are assuaged. [merge]

openshift-bot · 2017-03-07T15:37:31Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/909/) (Base Commit: 6fa29de) (Image: devenv-rhel7_6052)

openshift-bot · 2017-03-07T15:37:32Z

Evaluated for origin merge up to 612dc51

openshift-bot · 2017-03-07T15:37:33Z

[Test]ing while waiting on the merge queue

openshift-bot · 2017-03-07T15:45:38Z

Evaluated for origin test up to 612dc51

openshift-bot · 2017-03-07T17:04:34Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/909/) (Base Commit: 6fa29de)

Fix of BUG 1405440

612dc51

Using TCPSocketAction as the liveness probe, which will not be affected by the connection limit set in HAProxy's config file. This is a TRUE fix for BUG 1405440.

pweil- added the priority/P1 label Mar 7, 2017

pweil- assigned knobunc Mar 7, 2017

openshift-bot merged commit 161cc0a into openshift:release-1.5 Mar 7, 2017

louyihua deleted the bug-fix-1405440-cherry-5b708a5 branch March 8, 2017 00:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix of BUG 1405440 (release-1.5 cherry-pick) #13273

Fix of BUG 1405440 (release-1.5 cherry-pick) #13273

louyihua commented Mar 7, 2017 •

edited

Loading

pweil- commented Mar 7, 2017

knobunc commented Mar 7, 2017

louyihua commented Mar 7, 2017 •

edited

Loading

knobunc commented Mar 7, 2017

louyihua commented Mar 7, 2017

knobunc commented Mar 7, 2017

openshift-bot commented Mar 7, 2017 •

edited

Loading

openshift-bot commented Mar 7, 2017

openshift-bot commented Mar 7, 2017

openshift-bot commented Mar 7, 2017

openshift-bot commented Mar 7, 2017

Fix of BUG 1405440 (release-1.5 cherry-pick) #13273

Fix of BUG 1405440 (release-1.5 cherry-pick) #13273

Conversation

louyihua commented Mar 7, 2017 • edited Loading

pweil- commented Mar 7, 2017

knobunc commented Mar 7, 2017

louyihua commented Mar 7, 2017 • edited Loading

knobunc commented Mar 7, 2017

louyihua commented Mar 7, 2017

knobunc commented Mar 7, 2017

openshift-bot commented Mar 7, 2017 • edited Loading

openshift-bot commented Mar 7, 2017

openshift-bot commented Mar 7, 2017

openshift-bot commented Mar 7, 2017

openshift-bot commented Mar 7, 2017

louyihua commented Mar 7, 2017 •

edited

Loading

louyihua commented Mar 7, 2017 •

edited

Loading

openshift-bot commented Mar 7, 2017 •

edited

Loading