-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sdn: set veth TX queue length to unblock QoS #11126
Conversation
docker/libnetwork appear to create the pod veth interfaces with an explicitly 0 txqlen instead of leaving the default 1000 that the kernel would assign. A txqlen of 0 blocks any packets when QoS is enabled for a port due to the way the Linux HTB shaper works. Explicitly set the veth txqlen to make QoS work again. Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1378000
@openshift/networking @eparis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but should we open an issue with docker / libnetwork too?
See https://bugzilla.redhat.com/show_bug.cgi?id=1152231 for background on why veth interfaces now work this way... |
Its pretty murky; the explicit txqlen:0 was done back in 2014 to work around the bug that the kernel change from 1152231 fixes. So maybe it's time to revisit. But since we may not use docker for creating the veth in the future anyway, it probably doesn't matter much to us. |
OK... so... With old kernels, veths were created by default in such a way that they had bad performance (docker-archive/libcontainer#193). Setting txqueuelen to 0 fixes the performance problem but would cause explicitly-added qdiscs to fail (docker-archive/libcontainer#193 (comment)). Docker went with that approach anyway though, because they decided qdiscs were an advanced feature they didn't need to support (docker-archive/libcontainer#193 (comment)). Except meanwhile, it turns out that the original theory that setting txqueuelen=0 would cause explicitly-added qdiscs to fail was mostly wrong, and in particular doesn't apply to the "htb" qdisc which we are using (http://www.spinics.net/lists/netdev/msg337586.html) (although it defaults to txqueuelen=2 in this case, not 1000, which seems like it might be... poor?) So with old kernels, OpenShift worked, even when using QoS. Phil and Jesper eventually fixed the original "veths were created by default in such a way that they had bad performance" bug (https://bugzilla.redhat.com/show_bug.cgi?id=1152231), which means that the original libcontainer "set txqueuelen to 0" workaround is no longer needed (yay!), but in the process they also removed the workaround for txqueuelen==0 in htb (https://www.mail-archive.com/[email protected]/msg74390.html), which means that the "set txqueuelen to 0" workaround is no longer allowed (boo!). (This seems like it might count as the sort of thing that kernel hackers consider an API break? Or maybe not, depending on exactly how poorly the txqueuelen of 2 worked...) Anyway... with this patch:
So LGTM. [merge] |
[Test]ing while waiting on the merge queue |
Evaluated for origin test up to 8a2fbcf |
continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/9384/) |
Travis failure is #11015 so re-[merge] |
LGTM, @danwinship thx for the nice explanation |
[merge] |
flake is #11165 re-[merge] |
Evaluated for origin merge up to 8a2fbcf |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/9526/) (Image: devenv-rhel7_5102) |
It is a clear misconfiguration to attach a qdisc to a device with tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred, htb, plug and sfb) inherit/copy this value as their queue length. Why should the kernel catch such a misconfiguration? Because prior to introducing the IFF_NO_QUEUE device flag, userspace found a loophole in the qdisc config system that allowed them to achieve the equivalent of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from a device. The loophole on older kernels is setting tx_queue_len=0, *prior* to device qdisc init (the config time is significant, simply setting tx_queue_len=0 doesn't trigger the loophole). This loophole is currently used by Docker[1] to get better performance and scalability out of the veth device. The Docker developers were warned[1] that they needed to adjust the tx_queue_len if ever attaching a qdisc. The OpenShift project didn't remember this warning and attached a qdisc, this were caught and fixed in[2]. [1] docker-archive/libcontainer#193 [2] openshift/origin#11126 Instead of fixing every userspace program that used this loophole, and forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's catch the misconfiguration on the kernel side. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
It is a clear misconfiguration to attach a qdisc to a device with tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred, htb, plug and sfb) inherit/copy this value as their queue length. Why should the kernel catch such a misconfiguration? Because prior to introducing the IFF_NO_QUEUE device flag, userspace found a loophole in the qdisc config system that allowed them to achieve the equivalent of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from a device. The loophole on older kernels is setting tx_queue_len=0, *prior* to device qdisc init (the config time is significant, simply setting tx_queue_len=0 doesn't trigger the loophole). This loophole is currently used by Docker[1] to get better performance and scalability out of the veth device. The Docker developers were warned[1] that they needed to adjust the tx_queue_len if ever attaching a qdisc. The OpenShift project didn't remember this warning and attached a qdisc, this were caught and fixed in[2]. [1] docker-archive/libcontainer#193 [2] openshift/origin#11126 Instead of fixing every userspace program that used this loophole, and forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's catch the misconfiguration on the kernel side. Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: David S. Miller <[email protected]>
It is a clear misconfiguration to attach a qdisc to a device with tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred, htb, plug and sfb) inherit/copy this value as their queue length. Why should the kernel catch such a misconfiguration? Because prior to introducing the IFF_NO_QUEUE device flag, userspace found a loophole in the qdisc config system that allowed them to achieve the equivalent of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from a device. The loophole on older kernels is setting tx_queue_len=0, *prior* to device qdisc init (the config time is significant, simply setting tx_queue_len=0 doesn't trigger the loophole). This loophole is currently used by Docker[1] to get better performance and scalability out of the veth device. The Docker developers were warned[1] that they needed to adjust the tx_queue_len if ever attaching a qdisc. The OpenShift project didn't remember this warning and attached a qdisc, this were caught and fixed in[2]. [1] docker-archive/libcontainer#193 [2] openshift/origin#11126 Instead of fixing every userspace program that used this loophole, and forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's catch the misconfiguration on the kernel side. Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: David S. Miller <[email protected]>
It is a clear misconfiguration to attach a qdisc to a device with tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred, htb, plug and sfb) inherit/copy this value as their queue length. Why should the kernel catch such a misconfiguration? Because prior to introducing the IFF_NO_QUEUE device flag, userspace found a loophole in the qdisc config system that allowed them to achieve the equivalent of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from a device. The loophole on older kernels is setting tx_queue_len=0, *prior* to device qdisc init (the config time is significant, simply setting tx_queue_len=0 doesn't trigger the loophole). This loophole is currently used by Docker[1] to get better performance and scalability out of the veth device. The Docker developers were warned[1] that they needed to adjust the tx_queue_len if ever attaching a qdisc. The OpenShift project didn't remember this warning and attached a qdisc, this were caught and fixed in[2]. [1] docker-archive/libcontainer#193 [2] openshift/origin#11126 Instead of fixing every userspace program that used this loophole, and forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's catch the misconfiguration on the kernel side. Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: David S. Miller <[email protected]>
It is a clear misconfiguration to attach a qdisc to a device with tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred, htb, plug and sfb) inherit/copy this value as their queue length. Why should the kernel catch such a misconfiguration? Because prior to introducing the IFF_NO_QUEUE device flag, userspace found a loophole in the qdisc config system that allowed them to achieve the equivalent of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from a device. The loophole on older kernels is setting tx_queue_len=0, *prior* to device qdisc init (the config time is significant, simply setting tx_queue_len=0 doesn't trigger the loophole). This loophole is currently used by Docker[1] to get better performance and scalability out of the veth device. The Docker developers were warned[1] that they needed to adjust the tx_queue_len if ever attaching a qdisc. The OpenShift project didn't remember this warning and attached a qdisc, this were caught and fixed in[2]. [1] docker-archive/libcontainer#193 [2] openshift/origin#11126 Instead of fixing every userspace program that used this loophole, and forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's catch the misconfiguration on the kernel side. Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: David S. Miller <[email protected]>
docker/libnetwork appear to create the pod veth interfaces with
an explicitly 0 txqlen instead of leaving the default 1000 that
the kernel would assign. A txqlen of 0 blocks any packets when
QoS is enabled for a port due to the way the Linux HTB shaper
works. Explicitly set the veth txqlen to make QoS work again.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1378000