-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make haproxy maxconn configurable #12716
Conversation
@knobunc PTAL |
[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[test] |
awesome, thank you very much for this. /cc @jmencak |
@pecameron @knobunc do you mean why this failed?
|
@ncdc Yes that is the issue. It has happened twice so far. |
The origin container log indicates the pod wasn't healthy. I'd recommend you dig in there to see if you can find out why. |
@ndnc @knobunc @ramr The pod is the haproxy router. I have made a change in the template and resulting config file (added maxconn 20000). The changes work on my lab machines but not here. Do you know if port 1936 is open in iptables (healthz uses it)? Also, do you know where the router is set up? Is there a dump of the environment variables in one of the log files? We have looked at the usual suspects and can't get to the bottom of this. |
Spin up a devenv in EC2 and test it there? You should be able to do
whatever sort of debugging you want in that env.
…On Tue, Jan 31, 2017 at 2:37 PM, Phil Cameron ***@***.***> wrote:
@ndnc @knobunc <https://github.com/knobunc> @ramr
<https://github.com/ramr> The pod is the haproxy router. I have made a
change in the template and resulting config file (added maxconn 20000). The
changes work on my lab machines but not here. Do you know if port 1936 is
open in iptables (healthz uses it)? Also, do you know where the router is
set up? Is there a dump of the environment variables in one of the log
files? We have looked at the usual suspects and can't get to the bottom of
this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12716 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAABYqSyaKfdY-Ku7ci__qJtwa_Ekgsfks5rX41ygaJpZM4LxxDr>
.
|
[test] |
@ramr @knobunc @rajatchopra I need some help with this. I added maxconn 20000 to the haproxy config file in the global and default sections (where the commented maxconn 4096 was coded) and the tests fail the same way in the same spot. When I remove the maxcon 20000 the tests pass. I can't find out what is causing the failure. Do any of you have any idea what is happening? I am trying to get this reproduced on a EC2 cluster but no luck yet. This works on my lab cluster so I think it has something to do with EC2. Thanks! |
@pecameron Likely a sysctl limit. |
@rajatchopra Thanks. @ncdc how can we find the limits for 'sysctl fs.nr_open' and 'sysctl fs.file-max' on the test system? Is it in one of the log files? |
5507337
to
bdb9aa5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
2dca65c
to
73ea351
Compare
@openshift/cli-review - new ipfailover option --max-connections PTAL |
[test] |
@@ -6,7 +6,8 @@ | |||
{{ define "/var/lib/haproxy/conf/haproxy.config" }} | |||
{{ $workingDir := .WorkingDir }} | |||
global | |||
# maxconn 4096 | |||
maxconn {{env "ROUTER_MAX_CONNECTIONS" "20000"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break existing routers... please make the default 2k as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it break existing routers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, while supporting this is ok for now, I don't think this is a value that should be hardcoded in the long run. The router ultimately needs to make a determination about how many connections to allow based on other factors. Arbitrary constants that don't work for every cluster are a recipe for a break at the worst time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the time I made the comment, the theory about why the tests were breaking was that the AWS file handle limits were not allowing 200k. We had dropped everything to 2k in oadm, so this was defaulting the existing routers to a higher number.
We found that the AWS limit was huge and not causing the problem (rather it was a default coming from 'oc cluster up' that was not set, so setting maxconns to 0). So this comment is stale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton a maxconns=auto type thing would be an RFE against haproxy. The idea makes sense to me, I wonder if it's been discussed upstream before.
Edit:
From the docs: http://cbonte.github.io/haproxy-dconv/1.7/configuration.html#maxconn
This is new in 1.7:
snip:
If this value is not set, it will default to the value
set in DEFAULT_MAXCONN at build time (reported in haproxy -vv) if no memory
limit is enforced, or will be computed based on the memory limit, the buffer
size, memory allocated to compression, SSL cache size, and use or not of SSL
and the associated maxsslconn (which can also be automatic).```
@@ -33,7 +34,8 @@ global | |||
# ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:ECDHE-RSA-DES-CBC3-SHA:ECDHE-ECDSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:DES-CBC3-SHA:HIGH:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA | |||
|
|||
defaults | |||
# maxconn 4096 | |||
maxconn {{env "ROUTER_MAX_CONNECTIONS" "20000"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
@openshift/cli-review PTAL |
@knobunc fix worked. Anything else needed before merge? |
@@ -157,6 +157,7 @@ func (h *Helper) InstallRouter(kubeClient kclientset.Interface, f *clientcmd.Fac | |||
HostNetwork: !portForwarding, | |||
HostPorts: true, | |||
ServiceAccount: "router", | |||
MaxConnections: 2000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this more last night and I think this is the wrong approach. I think we make a maxconns of 0 mean 'use the default'. It makes no sense for 0 to be set ever. So let's not change this file, but handle it down below in router.go.
pkg/cmd/admin/router/router.go
Outdated
@@ -232,6 +236,9 @@ const ( | |||
// Default stats and healthz port. | |||
defaultStatsPort = 1936 | |||
defaultHealthzPort = defaultStatsPort | |||
|
|||
// Default maximum number of connections | |||
defaultMaxConnections = 2000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And let's go back to 20,000 since the problem was not the 2000
@knobunc OK changing default to 20000. --max-connections=0 changed to default (20000) |
@@ -6,7 +6,8 @@ | |||
{{ define "/var/lib/haproxy/conf/haproxy.config" }} | |||
{{ $workingDir := .WorkingDir }} | |||
global | |||
# maxconn 4096 | |||
maxconn {{env "ROUTER_MAX_CONNECTIONS" "2000"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
20000 please
@@ -33,7 +34,8 @@ global | |||
# ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:ECDHE-RSA-DES-CBC3-SHA:ECDHE-ECDSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:DES-CBC3-SHA:HIGH:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA | |||
|
|||
defaults | |||
# maxconn 4096 | |||
maxconn {{env "ROUTER_MAX_CONNECTIONS" "2000"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@knobunc fixed |
pkg/cmd/admin/router/router.go
Outdated
@@ -250,6 +257,8 @@ func NewCmdRouter(f *clientcmd.Factory, parentName, name string, out, errout io. | |||
StatsPort: defaultStatsPort, | |||
HostNetwork: true, | |||
HostPorts: true, | |||
|
|||
MaxConnections: defaultMaxConnections, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't set a default on the client - the router should set that, not here. The default value here should be ""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, client code that has server side components should never impose defaults - the server should do that. Same as with validation. Client side API logic is not appropriate.
@smarterclayton changes are in PTAL |
pkg/cmd/infra/router/template.go
Outdated
@@ -91,6 +92,7 @@ func (o *TemplateRouter) Bind(flag *pflag.FlagSet) { | |||
flag.DurationVar(&o.ReloadInterval, "interval", reloadInterval(), "Controls how often router reloads are invoked. Mutiple router reload requests are coalesced for the duration of this interval since the last reload time.") | |||
flag.BoolVar(&o.ExtendedValidation, "extended-validation", util.Env("EXTENDED_VALIDATION", "true") == "true", "If set, then an additional extended validation step is performed on all routes admitted in by this router. Defaults to true and enables the extended validation checks.") | |||
flag.BoolVar(&o.BindPortsAfterSync, "bind-ports-after-sync", util.Env("ROUTER_BIND_PORTS_AFTER_SYNC", "") == "true", "Bind ports only after route state has been synchronized") | |||
flag.StringVar(&o.MaxConnections, "max-connections", util.Env("ROUTER_MAX_CONNECTIONS", ""), "Specifies the maximum number of concurrent connections. Default 20000") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Default 20000
is appropriate - that's only the default for newer images that we provide.
@smarterclayton @knobunc this has in successive commits been 2000, then 20000, then 2000, then 20000 (literally!) |
On 02/03/2017 05:23 PM, Clayton Coleman wrote:
***@***.**** commented on this pull request.
------------------------------------------------------------------------
In images/router/haproxy/conf/haproxy-config.template
<#12716>:
> @@ -6,7 +6,8 @@
{{ define "/var/lib/haproxy/conf/haproxy.config" }}
{{ $workingDir := .WorkingDir }}
global
- # maxconn 4096
+ maxconn {{env "ROUTER_MAX_CONNECTIONS" "20000"}}
Why would it break existing routers?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12716>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANUgekIMG9rN7u-CBkmJjRmdGV88H3w_ks5rY6jigaJpZM4LxxDr>.
Not sure what you mean here. This just allows for more connections. It
doesn't break anything. The default of 2000 is really too low for todays
servers.
|
On 02/03/2017 05:24 PM, Clayton Coleman wrote:
***@***.**** commented on this pull request.
------------------------------------------------------------------------
In images/router/haproxy/conf/haproxy-config.template
<#12716>:
> @@ -6,7 +6,8 @@
{{ define "/var/lib/haproxy/conf/haproxy.config" }}
{{ $workingDir := .WorkingDir }}
global
- # maxconn 4096
+ maxconn {{env "ROUTER_MAX_CONNECTIONS" "20000"}}
Also, while supporting this is ok for now, I don't think this is a
value that should be hardcoded in the long run. The router ultimately
needs to make a determination about how many connections to allow
based on other factors. Arbitrary constants that don't work for
/every/ cluster are a recipe for a break at the worst time.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12716>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANUgevqln6rEk8Wyrfaocy2FRvnLWMyoks5rY6kZgaJpZM4LxxDr>.
The value of 2000 is hard coded in haproxy. It is really too low.
|
The haproxy maxconn (maximum connections) is by default 20000. This was previously 2000. This change makes that configurable through the oadm router --max-connections= option when creating a router. For existing routers the value can be set in the ROUTER_MAX_CONNECTIONS environment variable. If ROUTER_MAX_CONNECTIONS is missing, the deafult (20000) is used. openshift-docs PR 3609 bug 1405440 https://bugzilla.redhat.com/show_bug.cgi?id=1405440
Evaluated for origin test up to 119f9b1 |
@smarterclayton would you like to take another pass? Now 'oadm router' only sets the maxconns environment variable if passed. The template default (when the env is not set) is 20,000. |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/13616/) (Base Commit: 6f30f7e) |
[merge] |
Evaluated for origin merge up to 119f9b1 |
changes LGTM |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/13641/) (Base Commit: c955745) (Image: devenv-rhel7_5865) |
My opinion, it is good to make haproxy's maxconn configurable. |
@louyihua agreed that this is only one approach to the actual fix. We did propose a different method of probing using scripts and connecting to the haproxy socket, but I think @smarterclayton thought this was fragile? We have it working, and it does solve the situation. Anyway it looks like in newer versions of haproxy there is a way to make maxconn dynamic. That needs investigation (I just dug it up yesterday), and seems like the right way to go. Plus, as of 3.4 we have an easy way for users to overwrite the haproxy config using a configmap. And then you also have this PR, which makes maxconn configurable at the time of deployment. The change from 2000->20000 brings haproxy config in-line with how it's deployed in Red Hat's OpenStack product. It also is as I said, a safe default for us to improve OOTB experience of OCP for a large swath of customers (2000 is very, very easy to hit). We posed our situation to the upstream list; discussed here: https://www.mail-archive.com/[email protected]/msg24553.html I'll go on-record one last time about this. I think access to the healthz endpoint should not be counted against maxconn. |
The haproxy max_conn (maximum connnections) is by default 2000. This
change make that configurable through the
oadm router --max-connections= option when creating a router and through
changing the ROUTER_MAX_CONNECTIONS environment variable in an existing
router.
bug 1405440
https://bugzilla.redhat.com/show_bug.cgi?id=1405440