Make haproxy maxconn configurable #12716

pecameron · 2017-01-30T18:44:25Z

The haproxy max_conn (maximum connnections) is by default 2000. This
change make that configurable through the
oadm router --max-connections= option when creating a router and through
changing the ROUTER_MAX_CONNECTIONS environment variable in an existing
router.

bug 1405440
https://bugzilla.redhat.com/show_bug.cgi?id=1405440

pecameron · 2017-01-30T18:45:15Z

@knobunc PTAL

knobunc · 2017-01-30T21:24:48Z

[test]

knobunc

LGTM

pecameron · 2017-01-31T13:59:12Z

[test]

jeremyeder · 2017-01-31T15:23:33Z

awesome, thank you very much for this. /cc @jmencak

pecameron · 2017-01-31T18:36:46Z

@ncdc @knobunc and I have looked at this test failure and can't find the cause. Would you be willing to take a look?

ncdc · 2017-01-31T19:11:30Z

@pecameron @knobunc do you mean why this failed?

Running test/end-to-end/core.sh:505: executing 'curl --max-time 2 --fail --silent 'http://172.18.13.218:1936/healthz'' expecting success; re-trying every 0.2s until completion or 300.000s...
FAILURE after 299.792s: test/end-to-end/core.sh:505: executing 'curl --max-time 2 --fail --silent 'http://172.18.13.218:1936/healthz'' expecting success; re-trying every 0.2s until completion or 300.000s: the command timed out
Standard output from the command:
Standard error from the command:
[ERROR] PID 17693: hack/lib/cmd.sh:532: `return "${return_code}"` exited with status 1.
[INFO] 		Stack Trace: 
[INFO] 		  1: hack/lib/cmd.sh:532: `return "${return_code}"`
[INFO] 		  2: test/end-to-end/core.sh:505: os::cmd::try_until_success
[INFO]   Exiting with code 1.
[ERROR] PID 16421: hack/test-end-to-end-docker.sh:91: `${OS_ROOT}/test/end-to-end/core.sh` exited with status 1.

pecameron · 2017-01-31T19:19:46Z

@ncdc Yes that is the issue. It has happened twice so far.

ncdc · 2017-01-31T19:25:55Z

The origin container log indicates the pod wasn't healthy. I'd recommend you dig in there to see if you can find out why.

pecameron · 2017-01-31T19:37:17Z

@ndnc @knobunc @ramr The pod is the haproxy router. I have made a change in the template and resulting config file (added maxconn 20000). The changes work on my lab machines but not here. Do you know if port 1936 is open in iptables (healthz uses it)? Also, do you know where the router is set up? Is there a dump of the environment variables in one of the log files? We have looked at the usual suspects and can't get to the bottom of this.

ncdc · 2017-01-31T19:39:38Z

Spin up a devenv in EC2 and test it there? You should be able to do whatever sort of debugging you want in that env.

…

On Tue, Jan 31, 2017 at 2:37 PM, Phil Cameron ***@***.***> wrote: @ndnc @knobunc <https://github.com/knobunc> @ramr <https://github.com/ramr> The pod is the haproxy router. I have made a change in the template and resulting config file (added maxconn 20000). The changes work on my lab machines but not here. Do you know if port 1936 is open in iptables (healthz uses it)? Also, do you know where the router is set up? Is there a dump of the environment variables in one of the log files? We have looked at the usual suspects and can't get to the bottom of this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12716 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAABYqSyaKfdY-Ku7ci__qJtwa_Ekgsfks5rX41ygaJpZM4LxxDr> .

pecameron · 2017-02-01T16:04:06Z

[test]

pecameron · 2017-02-01T20:28:46Z

@ramr @knobunc @rajatchopra I need some help with this. I added maxconn 20000 to the haproxy config file in the global and default sections (where the commented maxconn 4096 was coded) and the tests fail the same way in the same spot. When I remove the maxcon 20000 the tests pass. I can't find out what is causing the failure. Do any of you have any idea what is happening? I am trying to get this reproduced on a EC2 cluster but no luck yet. This works on my lab cluster so I think it has something to do with EC2.

Thanks!
phil

rajatchopra · 2017-02-01T20:44:18Z

@pecameron Likely a sysctl limit.
Check 'sysctl fs.nr_open' and 'sysctl fs.file-max'. They should be sufficient to allow a maxconn of 20000, so say 1m at least. Check with the dev machine where the tests seem to pass.

pecameron · 2017-02-01T20:58:15Z

@rajatchopra Thanks. @ncdc how can we find the limits for 'sysctl fs.nr_open' and 'sysctl fs.file-max' on the test system? Is it in one of the log files?

ramr

LGTM

pecameron · 2017-02-02T18:29:22Z

@openshift/cli-review - new ipfailover option --max-connections PTAL
Also, openshift-docs PR 3609

pecameron · 2017-02-02T18:29:35Z

[test]

knobunc · 2017-02-02T18:32:35Z

images/router/haproxy/conf/haproxy-config.template

@@ -6,7 +6,8 @@
 {{ define "/var/lib/haproxy/conf/haproxy.config" }}
 {{ $workingDir := .WorkingDir }}
 global
-  # maxconn 4096
+  maxconn {{env "ROUTER_MAX_CONNECTIONS" "20000"}}


This will break existing routers... please make the default 2k as well.

Why would it break existing routers?

Also, while supporting this is ok for now, I don't think this is a value that should be hardcoded in the long run. The router ultimately needs to make a determination about how many connections to allow based on other factors. Arbitrary constants that don't work for every cluster are a recipe for a break at the worst time.

At the time I made the comment, the theory about why the tests were breaking was that the AWS file handle limits were not allowing 200k. We had dropped everything to 2k in oadm, so this was defaulting the existing routers to a higher number.

We found that the AWS limit was huge and not causing the problem (rather it was a default coming from 'oc cluster up' that was not set, so setting maxconns to 0). So this comment is stale.

@smarterclayton a maxconns=auto type thing would be an RFE against haproxy. The idea makes sense to me, I wonder if it's been discussed upstream before.

Edit:

From the docs: http://cbonte.github.io/haproxy-dconv/1.7/configuration.html#maxconn

This is new in 1.7:

snip:

If this value is not set, it will default to the value set in DEFAULT_MAXCONN at build time (reported in haproxy -vv) if no memory limit is enforced, or will be computed based on the memory limit, the buffer size, memory allocated to compression, SSL cache size, and use or not of SSL and the associated maxsslconn (which can also be automatic).```

knobunc · 2017-02-02T18:32:43Z

images/router/haproxy/conf/haproxy-config.template

@@ -33,7 +34,8 @@ global
  # ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:ECDHE-RSA-DES-CBC3-SHA:ECDHE-ECDSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:DES-CBC3-SHA:HIGH:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA

 defaults
-  # maxconn 4096
+  maxconn {{env "ROUTER_MAX_CONNECTIONS" "20000"}}


knobunc · 2017-02-02T18:33:21Z

@openshift/cli-review PTAL

pecameron · 2017-02-03T13:31:04Z

@knobunc fix worked. Anything else needed before merge?

knobunc · 2017-02-03T14:45:47Z

pkg/bootstrap/docker/openshift/admin.go

@@ -157,6 +157,7 @@ func (h *Helper) InstallRouter(kubeClient kclientset.Interface, f *clientcmd.Fac
 		HostNetwork:        !portForwarding,
 		HostPorts:          true,
 		ServiceAccount:     "router",
+		MaxConnections:     2000,


I thought about this more last night and I think this is the wrong approach. I think we make a maxconns of 0 mean 'use the default'. It makes no sense for 0 to be set ever. So let's not change this file, but handle it down below in router.go.

knobunc · 2017-02-03T14:46:11Z

pkg/cmd/admin/router/router.go

@@ -232,6 +236,9 @@ const (
 	// Default stats and healthz port.
 	defaultStatsPort   = 1936
 	defaultHealthzPort = defaultStatsPort
+
+	// Default maximum number of connections
+	defaultMaxConnections = 2000


And let's go back to 20,000 since the problem was not the 2000

pecameron · 2017-02-03T15:30:22Z

@knobunc OK changing default to 20000. --max-connections=0 changed to default (20000)

knobunc · 2017-02-03T19:30:05Z

images/router/haproxy/conf/haproxy-config.template

@@ -6,7 +6,8 @@
 {{ define "/var/lib/haproxy/conf/haproxy.config" }}
 {{ $workingDir := .WorkingDir }}
 global
-  # maxconn 4096
+  maxconn {{env "ROUTER_MAX_CONNECTIONS" "2000"}}


20000 please

knobunc · 2017-02-03T19:30:09Z

images/router/haproxy/conf/haproxy-config.template

@@ -33,7 +34,8 @@ global
  # ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:ECDHE-RSA-DES-CBC3-SHA:ECDHE-ECDSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:DES-CBC3-SHA:HIGH:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA

 defaults
-  # maxconn 4096
+  maxconn {{env "ROUTER_MAX_CONNECTIONS" "2000"}}


pecameron · 2017-02-03T19:42:11Z

@knobunc fixed

smarterclayton · 2017-02-03T20:13:40Z

pkg/cmd/admin/router/router.go

@@ -250,6 +257,8 @@ func NewCmdRouter(f *clientcmd.Factory, parentName, name string, out, errout io.
 		StatsPort:     defaultStatsPort,
 		HostNetwork:   true,
 		HostPorts:     true,
+
+		MaxConnections: defaultMaxConnections,


Don't set a default on the client - the router should set that, not here. The default value here should be ""

Basically, client code that has server side components should never impose defaults - the server should do that. Same as with validation. Client side API logic is not appropriate.

pecameron · 2017-02-03T21:49:21Z

@smarterclayton changes are in PTAL
Is there anything else you would like changed?

smarterclayton · 2017-02-03T22:22:40Z

pkg/cmd/infra/router/template.go

@@ -91,6 +92,7 @@ func (o *TemplateRouter) Bind(flag *pflag.FlagSet) {
 	flag.DurationVar(&o.ReloadInterval, "interval", reloadInterval(), "Controls how often router reloads are invoked. Mutiple router reload requests are coalesced for the duration of this interval since the last reload time.")
 	flag.BoolVar(&o.ExtendedValidation, "extended-validation", util.Env("EXTENDED_VALIDATION", "true") == "true", "If set, then an additional extended validation step is performed on all routes admitted in by this router. Defaults to true and enables the extended validation checks.")
 	flag.BoolVar(&o.BindPortsAfterSync, "bind-ports-after-sync", util.Env("ROUTER_BIND_PORTS_AFTER_SYNC", "") == "true", "Bind ports only after route state has been synchronized")
+	flag.StringVar(&o.MaxConnections, "max-connections", util.Env("ROUTER_MAX_CONNECTIONS", ""), "Specifies the maximum number of concurrent connections. Default 20000")


I don't think Default 20000 is appropriate - that's only the default for newer images that we provide.

pecameron · 2017-02-06T15:01:06Z

@smarterclayton @knobunc this has in successive commits been 2000, then 20000, then 2000, then 20000 (literally!)
The way this is set up, everyone that runs the 3.5 image will get 20000, unless otherwise specified. If they haven't bumped into the 2000 limit in 3.4 they will not notice the increased headroom.

pecameron · 2017-02-06T15:06:23Z

On 02/03/2017 05:23 PM, Clayton Coleman wrote: ***@***.**** commented on this pull request. ------------------------------------------------------------------------ In images/router/haproxy/conf/haproxy-config.template <#12716>: > @@ -6,7 +6,8 @@ {{ define "/var/lib/haproxy/conf/haproxy.config" }} {{ $workingDir := .WorkingDir }} global - # maxconn 4096 + maxconn {{env "ROUTER_MAX_CONNECTIONS" "20000"}} Why would it break existing routers? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12716>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ANUgekIMG9rN7u-CBkmJjRmdGV88H3w_ks5rY6jigaJpZM4LxxDr>.

Not sure what you mean here. This just allows for more connections. It doesn't break anything. The default of 2000 is really too low for todays servers.

pecameron · 2017-02-06T15:07:48Z

On 02/03/2017 05:24 PM, Clayton Coleman wrote: ***@***.**** commented on this pull request. ------------------------------------------------------------------------ In images/router/haproxy/conf/haproxy-config.template <#12716>: > @@ -6,7 +6,8 @@ {{ define "/var/lib/haproxy/conf/haproxy.config" }} {{ $workingDir := .WorkingDir }} global - # maxconn 4096 + maxconn {{env "ROUTER_MAX_CONNECTIONS" "20000"}} Also, while supporting this is ok for now, I don't think this is a value that should be hardcoded in the long run. The router ultimately needs to make a determination about how many connections to allow based on other factors. Arbitrary constants that don't work for /every/ cluster are a recipe for a break at the worst time. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12716>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ANUgevqln6rEk8Wyrfaocy2FRvnLWMyoks5rY6kZgaJpZM4LxxDr>.

The value of 2000 is hard coded in haproxy. It is really too low.

The haproxy maxconn (maximum connections) is by default 20000. This was previously 2000. This change makes that configurable through the oadm router --max-connections= option when creating a router. For existing routers the value can be set in the ROUTER_MAX_CONNECTIONS environment variable. If ROUTER_MAX_CONNECTIONS is missing, the deafult (20000) is used. openshift-docs PR 3609 bug 1405440 https://bugzilla.redhat.com/show_bug.cgi?id=1405440

openshift-bot · 2017-02-06T16:29:24Z

Evaluated for origin test up to 119f9b1

knobunc · 2017-02-06T16:47:15Z

@smarterclayton would you like to take another pass? Now 'oadm router' only sets the maxconns environment variable if passed. The template default (when the env is not set) is 20,000.

openshift-bot · 2017-02-06T17:50:45Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/13616/) (Base Commit: 6f30f7e)

knobunc · 2017-02-06T18:32:55Z

[merge]

openshift-bot · 2017-02-06T18:37:33Z

Evaluated for origin merge up to 119f9b1

smarterclayton · 2017-02-06T19:46:49Z

changes LGTM

openshift-bot · 2017-02-07T02:45:33Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/13641/) (Base Commit: c955745) (Image: devenv-rhel7_5865)

louyihua · 2017-02-07T03:35:28Z

My opinion, it is good to make haproxy's maxconn configurable.
But, I don't think just increasing the default value is a good way to solve the bug.
Instead, since haproxy cannot distinguish a normal request and a stats request, it may be better to use antoher way to probe pod's liveness (for example, a shell script) rather than using the HTTP get method.

jeremyeder · 2017-02-07T12:16:38Z

@louyihua agreed that this is only one approach to the actual fix. We did propose a different method of probing using scripts and connecting to the haproxy socket, but I think @smarterclayton thought this was fragile? We have it working, and it does solve the situation.

Anyway it looks like in newer versions of haproxy there is a way to make maxconn dynamic. That needs investigation (I just dug it up yesterday), and seems like the right way to go.

Plus, as of 3.4 we have an easy way for users to overwrite the haproxy config using a configmap. And then you also have this PR, which makes maxconn configurable at the time of deployment.

The change from 2000->20000 brings haproxy config in-line with how it's deployed in Red Hat's OpenStack product. It also is as I said, a safe default for us to improve OOTB experience of OCP for a large swath of customers (2000 is very, very easy to hit).

We posed our situation to the upstream list; discussed here: https://www.mail-archive.com/[email protected]/msg24553.html

I'll go on-record one last time about this. I think access to the healthz endpoint should not be counted against maxconn.

pecameron force-pushed the bz1405440 branch from ab89339 to edd9f9b Compare January 30, 2017 21:06

pecameron changed the title ~~Make haproxy max_conn configurable~~ Make haproxy maxconn configurable Jan 30, 2017

knobunc approved these changes Jan 30, 2017

View reviewed changes

jeremyeder mentioned this pull request Jan 31, 2017

Plumb new router max connections toggle up through ansible inventory openshift/openshift-ansible#3233

Closed

pecameron force-pushed the bz1405440 branch from edd9f9b to 8f1ac0c Compare February 1, 2017 14:07

pecameron force-pushed the bz1405440 branch from 8f1ac0c to 5b6f48b Compare February 1, 2017 18:10

pecameron force-pushed the bz1405440 branch 2 times, most recently from 5507337 to bdb9aa5 Compare February 1, 2017 21:31

ramr approved these changes Feb 1, 2017

View reviewed changes

pecameron force-pushed the bz1405440 branch 2 times, most recently from 2dca65c to 73ea351 Compare February 2, 2017 16:52

knobunc requested changes Feb 2, 2017

View reviewed changes

pecameron force-pushed the bz1405440 branch from 73ea351 to 0693784 Compare February 2, 2017 18:36

pecameron force-pushed the bz1405440 branch from 5577968 to 0851cb6 Compare February 2, 2017 21:49

knobunc requested changes Feb 3, 2017

View reviewed changes

pecameron force-pushed the bz1405440 branch from 0851cb6 to f2bd910 Compare February 3, 2017 16:04

knobunc requested changes Feb 3, 2017

View reviewed changes

pecameron force-pushed the bz1405440 branch from f2bd910 to 00832b1 Compare February 3, 2017 19:41

knobunc approved these changes Feb 3, 2017

View reviewed changes

smarterclayton suggested changes Feb 3, 2017

View reviewed changes

pecameron force-pushed the bz1405440 branch from 00832b1 to 4f6d736 Compare February 3, 2017 21:48

smarterclayton reviewed Feb 3, 2017

View reviewed changes

pecameron force-pushed the bz1405440 branch from 4f6d736 to 119f9b1 Compare February 6, 2017 16:21

openshift-bot merged commit 8e1be6c into openshift:master Feb 7, 2017

pecameron deleted the bz1405440 branch February 7, 2017 13:38

louyihua mentioned this pull request Feb 8, 2017

A new liveness probe for router pod #12846

Closed

Make haproxy maxconn configurable #12716

Make haproxy maxconn configurable #12716

Conversation

pecameron commented Jan 30, 2017

pecameron commented Jan 30, 2017

knobunc commented Jan 30, 2017

knobunc left a comment

Choose a reason for hiding this comment

pecameron commented Jan 31, 2017

jeremyeder commented Jan 31, 2017

pecameron commented Jan 31, 2017

ncdc commented Jan 31, 2017

pecameron commented Jan 31, 2017

ncdc commented Jan 31, 2017

pecameron commented Jan 31, 2017

ncdc commented Jan 31, 2017 via email

pecameron commented Feb 1, 2017

pecameron commented Feb 1, 2017

rajatchopra commented Feb 1, 2017

pecameron commented Feb 1, 2017

ramr left a comment

Choose a reason for hiding this comment

pecameron commented Feb 2, 2017

pecameron commented Feb 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyeder Feb 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knobunc commented Feb 2, 2017

pecameron commented Feb 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pecameron commented Feb 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pecameron commented Feb 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pecameron commented Feb 3, 2017

Choose a reason for hiding this comment

pecameron commented Feb 6, 2017

pecameron commented Feb 6, 2017 via email

pecameron commented Feb 6, 2017 via email

openshift-bot commented Feb 6, 2017

knobunc commented Feb 6, 2017

openshift-bot commented Feb 6, 2017

knobunc commented Feb 6, 2017

openshift-bot commented Feb 6, 2017

smarterclayton commented Feb 6, 2017

openshift-bot commented Feb 7, 2017 • edited Loading

louyihua commented Feb 7, 2017

jeremyeder commented Feb 7, 2017

jeremyeder Feb 6, 2017 •

edited

Loading

openshift-bot commented Feb 7, 2017 •

edited

Loading