Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix of BUG 1405440 (release-1.5 cherry-pick) #13273

Merged

Conversation

louyihua
Copy link
Contributor

@louyihua louyihua commented Mar 7, 2017

Using TCPSocketAction as the liveness probe, which will not be affected by the connection limit set in HAProxy's config file. This is a TRUE fix for BUG 1405440.

@pweil- @rajatchopra
Cherry-picked from the PR #13121.

Using TCPSocketAction as the liveness probe, which will not be affected by
the connection limit set in HAProxy's config file. This is a TRUE fix for
BUG 1405440.
@pweil-
Copy link

pweil- commented Mar 7, 2017

@knobunc @eparis please approve this pick. Looks like the referenced bug was high sev/prio and already verified in master.

@knobunc
Copy link
Contributor

knobunc commented Mar 7, 2017

@pweil- Hmm. This merged while I was away and I may have disputed the original merge. We fixed the original problem by upping the number of connections that were allowed (so I think the high sev part is addressed).

This PR changes the semantic of the health check to just see if the connection is accepted rather than exercising anything more in haproxy. So, if haproxy is wedged elsewhere, you can't know. @eparis, any thoughts?

@louyihua
Copy link
Contributor Author

louyihua commented Mar 7, 2017

@knobunc Some discussion about the health check was in #12846. And upping the number of maxconn just REDUCE/DELAY the BUG, not FIX it.

For short, currently used health check actually cannot provide any more in haproxy than this new check. But the old check would fail the pod when maxconn is reached and make all connections lost, while the new check can still success in such situation.

Considering that reaching the connection limit DOES NOT mean the pod fails, while it DOES mean that the pod is NOT READY for new incoming connections, keeping the old check as readiness check while change the liveness check should be a good way to really SOLVE the bug.

Maybe in future we can use better health check as liveness probe to both keep success at high load and provide more information in haproxy. But for now, the TCP connection based check should be a simple but effective way to solve the bug.

@knobunc
Copy link
Contributor

knobunc commented Mar 7, 2017

See also @eparis' concerns at https://bugzilla.redhat.com/show_bug.cgi?id=1405440#c6

What is the urgency for this particular backport? The change to maxconns is in, and there was some question on the original PR this is cherry-picking about how the new test would perform under load. Have we performed that testing?

@louyihua
Copy link
Contributor Author

louyihua commented Mar 7, 2017

For @eparis 's concern, I've checked the haproxy's code, and the 'not responding' is not true.

In fact, when connection limit reaches, haproxy actively disconnect new incoming connections just after it accepts the new socket, so the HTTP health check definitly fails as no HTTP response is generated, but the TCP check should succeed as the socket connection is really accepted. When cocurrent connections drops below the maxconn limit, both HTTP health check and TCP health check will succeed.

So, for the testing problem, what we just need to do is to set maxconn to a very small value (such as 5 or 10), then simulate high load by using telnet to hold as many as connections we set in maxconn, then the difference of these two health checks can be noticed. I've check in my computer through this way.

And, if there is really too much connections that exceeds the capability of haproxy, the TCP check will also fail, but this is the right semantic as only in such situation we should think haproxy is unusable. But this is the case we should prevent, and this is why I disputed the commit that ups the connection limit. Think that, if we just up the limit to prevent liveness probe to fail rather than limit the cocurrent connections to a single router pod, the final destiny will definitely falls into the case that there are too much connections for haproxy to handle. However, if we keep maxconn in a reasonable value, and uses the new liveness probe, we can not only prevent pod fail when connection limit reaches, but also keep cocurrent connections in the range that haproxy can handle well.

@knobunc
Copy link
Contributor

knobunc commented Mar 7, 2017

@louyihua Ok, thanks! In that case my concerns are assuaged. [merge]

@openshift-bot
Copy link
Contributor

openshift-bot commented Mar 7, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/909/) (Base Commit: 6fa29de) (Image: devenv-rhel7_6052)

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 612dc51

@openshift-bot
Copy link
Contributor

[Test]ing while waiting on the merge queue

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 612dc51

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/909/) (Base Commit: 6fa29de)

@openshift-bot openshift-bot merged commit 161cc0a into openshift:release-1.5 Mar 7, 2017
@louyihua louyihua deleted the bug-fix-1405440-cherry-5b708a5 branch March 8, 2017 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants