-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EgressNetworkPolicy rewrite drop traffic for too long #19276
Comments
ping @openshift/networking ? |
So this is dump for one of "this case"
So diff between traffic in first file and second file is 1second, and we can see rules is being populated as second file as more "records". Because drop rule exist we cant go out. there was cases when this "table requild" tooks up to 6 seconds. |
490 is
which just checks if
because that has to wait for a lock, so it might block on the resync in another thread. We need to rewrite this to generate the rules first and then write them out all at once after. |
@danwinship what is probably this will happen soon? if its low I might just do it myself.. |
I'll probably get to it soonish |
@danwinship We use the ovs-multitenant plugin and have an issue when using the EgressNetworkPolicy objects. It happens "randomly" but basically the issue is that all access to a pod that has been restarted after the incident (I'm calling the point at which the bug starts its effect the "incident"), or a newly provisioned pod, is basically unreachable by other pods. Eventually the entire cluster becomes unusable and we have to restart the nodes for everything to work again. After removing all EgressNetworkPolicy objects, the problem goes away. Are we understanding it correctly that you are dropping all traffic to get a snapshot/dump of the OVS rules, and then do some manipulation and write down the changes, and then allow traffic to start again? And could this be a reason for the abovementioned? |
@mojsha do you use upstream organization DNS for this? |
@mjudeikis We use OpenShift's SkyDNS in combination with an upstream DNS for non-cluster addresses. |
@danwinship No, I'm using 3.6, and we are moving to 3.7.2. What makes you think that this issue was fixed in 3.7 (it's not apparent when I'm looking at the issue)? Can you confirm that the traffic is temporary "blocked"? If so, is there a work-around for this? I am thinking of our production setup where we will be using this and we don't want any service interruption. |
I know it's fixed in 3.7 from when it was committed. There's no workaround for the bug discussed here, but the fix will probably get backported to 3.7. (OCP 3.7 that is, not Origin. But you shouldn't be running old versions of Origin.) |
Fair enough. We're in the process of moving to OCP. Can you clarify my second paragraph regarding whether update of an EgressNetworkPolicy object will result in network connectivity being dropped/stopped for a short duration and will lead to service interruption if there is an ongoing connection/session or a new incoming one? |
@danwinship bz:1558484 |
@mojsha most users have not noticed any disruption. It may be that the disruption only occurs if, eg, you have a DNS-based rule and the hostname it refers to has slow/flaky DNS servers. Traffic is silently dropped during the interruption; it is not actively rejected. Assuming the interruption is brief, then the packets will get retried and you won't even notice. |
@mojsha I can confirm that cluster I observed this contains hundreds of Egress rules, and they use FQDN's, where DNS is not the fastest. It's not perfect, but the impact is minimal if you dont over abuse Egress rules. So know the limitations, but overall its not critical. If we move to full IP's based rules, there is no issue at all. |
@danwinship @mjudeikis Thanks guys. Makes me much more comfortable re-enabling the usage of EgressNetworkPolicy objects once we complete the upgrade to 3.7. |
I have multiple EgressNetworkPolicies in my cluster.
Policy:
namespace:
So I have rules created:
All good and nice. But there is this piece of code:
https://github.com/openshift/origin/blob/master/pkg/network/node/ovscontroller.go#L426
which runs each ~30 min.
Now my
www.foo.com
is behind internal dns and its slower than usual.Spolier. From this point I do some assumption without knowing code base
This part of code get the long execution and hangs (?)
https://github.com/openshift/origin/blob/master/pkg/network/node/ovscontroller.go#L447-L487
DNS should be taken from the cache, but by any change, it can be blocked by this:
https://github.com/openshift/origin/blob/master/pkg/network/node/ovscontroller.go#L490
Long story short, time between:
otx.AddFlow("table=101, reg0=%d, cookie=1, priority=65535, actions=drop", vnid)
and
otx.DeleteFlows("table=101, reg0=%d, cookie=1/1", vnid)
in the code causes downtime.
still trying to debug onsite with the cluster. will update as I find something. But any comments are welcome.
The text was updated successfully, but these errors were encountered: