Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make haproxy maxconn configurable #12716

Merged
merged 1 commit into from
Feb 7, 2017

Conversation

pecameron
Copy link
Contributor

The haproxy max_conn (maximum connnections) is by default 2000. This
change make that configurable through the
oadm router --max-connections= option when creating a router and through
changing the ROUTER_MAX_CONNECTIONS environment variable in an existing
router.

bug 1405440
https://bugzilla.redhat.com/show_bug.cgi?id=1405440

@pecameron
Copy link
Contributor Author

@knobunc PTAL

@pecameron pecameron changed the title Make haproxy max_conn configurable Make haproxy maxconn configurable Jan 30, 2017
@knobunc
Copy link
Contributor

knobunc commented Jan 30, 2017

[test]

Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pecameron
Copy link
Contributor Author

[test]

@jeremyeder
Copy link
Contributor

awesome, thank you very much for this. /cc @jmencak

@pecameron
Copy link
Contributor Author

@ncdc @knobunc and I have looked at this test failure and can't find the cause. Would you be willing to take a look?

@ncdc
Copy link
Contributor

ncdc commented Jan 31, 2017

@pecameron @knobunc do you mean why this failed?

Running test/end-to-end/core.sh:505: executing 'curl --max-time 2 --fail --silent 'http://172.18.13.218:1936/healthz'' expecting success; re-trying every 0.2s until completion or 300.000s...
FAILURE after 299.792s: test/end-to-end/core.sh:505: executing 'curl --max-time 2 --fail --silent 'http://172.18.13.218:1936/healthz'' expecting success; re-trying every 0.2s until completion or 300.000s: the command timed out
Standard output from the command:
Standard error from the command:
[ERROR] PID 17693: hack/lib/cmd.sh:532: `return "${return_code}"` exited with status 1.
[INFO] 		Stack Trace: 
[INFO] 		  1: hack/lib/cmd.sh:532: `return "${return_code}"`
[INFO] 		  2: test/end-to-end/core.sh:505: os::cmd::try_until_success
[INFO]   Exiting with code 1.
[ERROR] PID 16421: hack/test-end-to-end-docker.sh:91: `${OS_ROOT}/test/end-to-end/core.sh` exited with status 1.

@pecameron
Copy link
Contributor Author

@ncdc Yes that is the issue. It has happened twice so far.

@ncdc
Copy link
Contributor

ncdc commented Jan 31, 2017

The origin container log indicates the pod wasn't healthy. I'd recommend you dig in there to see if you can find out why.

@pecameron
Copy link
Contributor Author

@ndnc @knobunc @ramr The pod is the haproxy router. I have made a change in the template and resulting config file (added maxconn 20000). The changes work on my lab machines but not here. Do you know if port 1936 is open in iptables (healthz uses it)? Also, do you know where the router is set up? Is there a dump of the environment variables in one of the log files? We have looked at the usual suspects and can't get to the bottom of this.

@ncdc
Copy link
Contributor

ncdc commented Jan 31, 2017 via email

@pecameron
Copy link
Contributor Author

[test]

@pecameron
Copy link
Contributor Author

@ramr @knobunc @rajatchopra I need some help with this. I added maxconn 20000 to the haproxy config file in the global and default sections (where the commented maxconn 4096 was coded) and the tests fail the same way in the same spot. When I remove the maxcon 20000 the tests pass. I can't find out what is causing the failure. Do any of you have any idea what is happening? I am trying to get this reproduced on a EC2 cluster but no luck yet. This works on my lab cluster so I think it has something to do with EC2.

Thanks!
phil

@rajatchopra
Copy link
Contributor

@pecameron Likely a sysctl limit.
Check 'sysctl fs.nr_open' and 'sysctl fs.file-max'. They should be sufficient to allow a maxconn of 20000, so say 1m at least. Check with the dev machine where the tests seem to pass.

@pecameron
Copy link
Contributor Author

@rajatchopra Thanks. @ncdc how can we find the limits for 'sysctl fs.nr_open' and 'sysctl fs.file-max' on the test system? Is it in one of the log files?

@pecameron pecameron force-pushed the bz1405440 branch 2 times, most recently from 5507337 to bdb9aa5 Compare February 1, 2017 21:31
Copy link
Contributor

@ramr ramr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pecameron pecameron force-pushed the bz1405440 branch 2 times, most recently from 2dca65c to 73ea351 Compare February 2, 2017 16:52
@pecameron
Copy link
Contributor Author

@openshift/cli-review - new ipfailover option --max-connections PTAL
Also, openshift-docs PR 3609

@pecameron
Copy link
Contributor Author

[test]

@@ -6,7 +6,8 @@
{{ define "/var/lib/haproxy/conf/haproxy.config" }}
{{ $workingDir := .WorkingDir }}
global
# maxconn 4096
maxconn {{env "ROUTER_MAX_CONNECTIONS" "20000"}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break existing routers... please make the default 2k as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it break existing routers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, while supporting this is ok for now, I don't think this is a value that should be hardcoded in the long run. The router ultimately needs to make a determination about how many connections to allow based on other factors. Arbitrary constants that don't work for every cluster are a recipe for a break at the worst time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time I made the comment, the theory about why the tests were breaking was that the AWS file handle limits were not allowing 200k. We had dropped everything to 2k in oadm, so this was defaulting the existing routers to a higher number.

We found that the AWS limit was huge and not causing the problem (rather it was a default coming from 'oc cluster up' that was not set, so setting maxconns to 0). So this comment is stale.

Copy link
Contributor

@jeremyeder jeremyeder Feb 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton a maxconns=auto type thing would be an RFE against haproxy. The idea makes sense to me, I wonder if it's been discussed upstream before.

Edit:

From the docs: http://cbonte.github.io/haproxy-dconv/1.7/configuration.html#maxconn

This is new in 1.7:

snip:

If this value is not set, it will default to the value
set in DEFAULT_MAXCONN at build time (reported in haproxy -vv) if no memory
limit is enforced, or will be computed based on the memory limit, the buffer
size, memory allocated to compression, SSL cache size, and use or not of SSL
and the associated maxsslconn (which can also be automatic).```

@@ -33,7 +34,8 @@ global
# ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:ECDHE-RSA-DES-CBC3-SHA:ECDHE-ECDSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:DES-CBC3-SHA:HIGH:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA

defaults
# maxconn 4096
maxconn {{env "ROUTER_MAX_CONNECTIONS" "20000"}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@knobunc
Copy link
Contributor

knobunc commented Feb 2, 2017

@openshift/cli-review PTAL

@pecameron
Copy link
Contributor Author

@knobunc fix worked. Anything else needed before merge?

@@ -157,6 +157,7 @@ func (h *Helper) InstallRouter(kubeClient kclientset.Interface, f *clientcmd.Fac
HostNetwork: !portForwarding,
HostPorts: true,
ServiceAccount: "router",
MaxConnections: 2000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this more last night and I think this is the wrong approach. I think we make a maxconns of 0 mean 'use the default'. It makes no sense for 0 to be set ever. So let's not change this file, but handle it down below in router.go.

@@ -232,6 +236,9 @@ const (
// Default stats and healthz port.
defaultStatsPort = 1936
defaultHealthzPort = defaultStatsPort

// Default maximum number of connections
defaultMaxConnections = 2000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And let's go back to 20,000 since the problem was not the 2000

@pecameron
Copy link
Contributor Author

@knobunc OK changing default to 20000. --max-connections=0 changed to default (20000)

@@ -6,7 +6,8 @@
{{ define "/var/lib/haproxy/conf/haproxy.config" }}
{{ $workingDir := .WorkingDir }}
global
# maxconn 4096
maxconn {{env "ROUTER_MAX_CONNECTIONS" "2000"}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20000 please

@@ -33,7 +34,8 @@ global
# ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:ECDHE-RSA-DES-CBC3-SHA:ECDHE-ECDSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:DES-CBC3-SHA:HIGH:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA

defaults
# maxconn 4096
maxconn {{env "ROUTER_MAX_CONNECTIONS" "2000"}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@pecameron
Copy link
Contributor Author

@knobunc fixed

@@ -250,6 +257,8 @@ func NewCmdRouter(f *clientcmd.Factory, parentName, name string, out, errout io.
StatsPort: defaultStatsPort,
HostNetwork: true,
HostPorts: true,

MaxConnections: defaultMaxConnections,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't set a default on the client - the router should set that, not here. The default value here should be ""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, client code that has server side components should never impose defaults - the server should do that. Same as with validation. Client side API logic is not appropriate.

@pecameron
Copy link
Contributor Author

@smarterclayton changes are in PTAL
Is there anything else you would like changed?

@@ -91,6 +92,7 @@ func (o *TemplateRouter) Bind(flag *pflag.FlagSet) {
flag.DurationVar(&o.ReloadInterval, "interval", reloadInterval(), "Controls how often router reloads are invoked. Mutiple router reload requests are coalesced for the duration of this interval since the last reload time.")
flag.BoolVar(&o.ExtendedValidation, "extended-validation", util.Env("EXTENDED_VALIDATION", "true") == "true", "If set, then an additional extended validation step is performed on all routes admitted in by this router. Defaults to true and enables the extended validation checks.")
flag.BoolVar(&o.BindPortsAfterSync, "bind-ports-after-sync", util.Env("ROUTER_BIND_PORTS_AFTER_SYNC", "") == "true", "Bind ports only after route state has been synchronized")
flag.StringVar(&o.MaxConnections, "max-connections", util.Env("ROUTER_MAX_CONNECTIONS", ""), "Specifies the maximum number of concurrent connections. Default 20000")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Default 20000 is appropriate - that's only the default for newer images that we provide.

@pecameron
Copy link
Contributor Author

@smarterclayton @knobunc this has in successive commits been 2000, then 20000, then 2000, then 20000 (literally!)
The way this is set up, everyone that runs the 3.5 image will get 20000, unless otherwise specified. If they haven't bumped into the 2000 limit in 3.4 they will not notice the increased headroom.

@pecameron
Copy link
Contributor Author

pecameron commented Feb 6, 2017 via email

@pecameron
Copy link
Contributor Author

pecameron commented Feb 6, 2017 via email

The haproxy maxconn (maximum connections) is by default 20000. This
was previously 2000. This change makes that configurable through the
oadm router --max-connections= option when creating a router.  For
existing routers the value can be set in the ROUTER_MAX_CONNECTIONS
environment variable. If ROUTER_MAX_CONNECTIONS is missing, the
deafult (20000) is used.

openshift-docs PR 3609

bug 1405440
https://bugzilla.redhat.com/show_bug.cgi?id=1405440
@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 119f9b1

@knobunc
Copy link
Contributor

knobunc commented Feb 6, 2017

@smarterclayton would you like to take another pass? Now 'oadm router' only sets the maxconns environment variable if passed. The template default (when the env is not set) is 20,000.

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/13616/) (Base Commit: 6f30f7e)

@knobunc
Copy link
Contributor

knobunc commented Feb 6, 2017

[merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 119f9b1

@smarterclayton
Copy link
Contributor

changes LGTM

@openshift-bot
Copy link
Contributor

openshift-bot commented Feb 7, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/13641/) (Base Commit: c955745) (Image: devenv-rhel7_5865)

@louyihua
Copy link
Contributor

louyihua commented Feb 7, 2017

My opinion, it is good to make haproxy's maxconn configurable.
But, I don't think just increasing the default value is a good way to solve the bug.
Instead, since haproxy cannot distinguish a normal request and a stats request, it may be better to use antoher way to probe pod's liveness (for example, a shell script) rather than using the HTTP get method.

@openshift-bot openshift-bot merged commit 8e1be6c into openshift:master Feb 7, 2017
@jeremyeder
Copy link
Contributor

@louyihua agreed that this is only one approach to the actual fix. We did propose a different method of probing using scripts and connecting to the haproxy socket, but I think @smarterclayton thought this was fragile? We have it working, and it does solve the situation.

Anyway it looks like in newer versions of haproxy there is a way to make maxconn dynamic. That needs investigation (I just dug it up yesterday), and seems like the right way to go.

Plus, as of 3.4 we have an easy way for users to overwrite the haproxy config using a configmap. And then you also have this PR, which makes maxconn configurable at the time of deployment.

The change from 2000->20000 brings haproxy config in-line with how it's deployed in Red Hat's OpenStack product. It also is as I said, a safe default for us to improve OOTB experience of OCP for a large swath of customers (2000 is very, very easy to hit).

We posed our situation to the upstream list; discussed here: https://www.mail-archive.com/[email protected]/msg24553.html

I'll go on-record one last time about this. I think access to the healthz endpoint should not be counted against maxconn.

@pecameron pecameron deleted the bz1405440 branch February 7, 2017 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants