Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EgressNetworkPolicy rewrite drop traffic for too long #19276

Closed
mjudeikis opened this issue Apr 9, 2018 · 16 comments
Closed

EgressNetworkPolicy rewrite drop traffic for too long #19276

mjudeikis opened this issue Apr 9, 2018 · 16 comments

Comments

@mjudeikis
Copy link
Contributor

mjudeikis commented Apr 9, 2018

I have multiple EgressNetworkPolicies in my cluster.

Policy:

{
    "kind": "EgressNetworkPolicy",
    "apiVersion": "v1",
    "metadata": {
        "name": "default"
    },
    "spec": {
        "egress": [
            {
                "type": "Allow",
                "to": {
                    "cidrSelector": "1.2.3.0/24"
                }
            },
            {
                "type": "Allow",
                "to": {
                    "dnsName": "www.foo.com"
                }
            },
            {
                "type": "Deny",
                "to": {
                    "cidrSelector": "0.0.0.0/0"
                }
            }
        ]
    }
}

namespace:

[root@console-REPL ~]# oc describe netnamespace labs-dev                                                 
Name:           labs-dev                            
Created:        About an hour ago                   
Labels:         <none>                              
Annotations:    <none>                              
Name:           labs-dev                            
ID:             1674326                             
Egress IPs:     <none>                              
[root@console-REPL ~]# oc describe netnamespace labs-dev 

So I have rules created:

[root@node1 ~]# ovs-ofctl -O OpenFlow13 dump-flows br0 | grep 198c56                                                                                                                                               
 cookie=0x0, duration=3105.808s, table=60, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=172.30.26.73,nw_frag=later actions=load:0x198c56->NXM_NX_REG1[],load:0x2->NXM_NX_REG2[],goto_table:80                    
 cookie=0x0, duration=3105.801s, table=60, n_packets=0, n_bytes=0, priority=100,tcp,nw_dst=172.30.26.73,tp_dst=8080 actions=load:0x198c56->NXM_NX_REG1[],load:0x2->NXM_NX_REG2[],goto_table:80                     
 cookie=0x0, duration=3140.219s, table=80, n_packets=0, n_bytes=0, priority=100,reg0=0x198c56,reg1=0x198c56 actions=output:NXM_NX_REG2[]                                                                           
 cookie=0x0, duration=253.253s, table=101, n_packets=0, n_bytes=0, priority=3,ip,reg0=0x198c56,nw_dst=1.2.3.0/24 actions=output:2                                                                                  
 cookie=0x0, duration=253.247s, table=101, n_packets=0, n_bytes=0, priority=2,ip,reg0=0x198c56,nw_dst=107.23.81.155 actions=output:2                                                                               
 cookie=0x0, duration=253.242s, table=101, n_packets=0, n_bytes=0, priority=2,ip,reg0=0x198c56,nw_dst=107.23.205.115 actions=output:2                                                                              
 cookie=0x0, duration=253.233s, table=101, n_packets=0, n_bytes=0, priority=1,ip,reg0=0x198c56 actions=drop                                                                                                        
[root@node1 ~]#

All good and nice. But there is this piece of code:
https://github.com/openshift/origin/blob/master/pkg/network/node/ovscontroller.go#L426
which runs each ~30 min.

Now my www.foo.com is behind internal dns and its slower than usual.

Spolier. From this point I do some assumption without knowing code base
This part of code get the long execution and hangs (?)
https://github.com/openshift/origin/blob/master/pkg/network/node/ovscontroller.go#L447-L487

DNS should be taken from the cache, but by any change, it can be blocked by this:
https://github.com/openshift/origin/blob/master/pkg/network/node/ovscontroller.go#L490

Long story short, time between:
otx.AddFlow("table=101, reg0=%d, cookie=1, priority=65535, actions=drop", vnid)
and
otx.DeleteFlows("table=101, reg0=%d, cookie=1/1", vnid)
in the code causes downtime.

still trying to debug onsite with the cluster. will update as I find something. But any comments are welcome.

@mjudeikis
Copy link
Contributor Author

ping @openshift/networking ?

@mjudeikis
Copy link
Contributor Author

So this is dump for one of "this case"

[root@ocpd00068 flows]# grep 0x7f33b6 ./flows.2018_03_09_21:54_52 | grep table=101
		cookie=0x1,  table=101,   priority=65535,reg0=0x7f33b6 actions=drop
cookie=0x0,  table=101,   priority=21,ip,reg0=0x7f33b6,nw_dst=10.254.4.130 actions=output:2
cookie=0x0,  table=101,   priority=20,ip,reg0=0x7f33b6,nw_dst=10.254.0.101 actions=output:2
cookie=0x0,  table=101,   priority=19,ip,reg0=0x7f33b6,nw_dst=10.0.27.1 actions=output:2
cookie=0x0,  table=101,   priority=18,ip,reg0=0x7f33b6,nw_dst=10.239.98.20 actions=output:2
cookie=0x0,  table=101,   priority=17,ip,reg0=0x7f33b6,nw_dst=10.0.17.2 actions=output:2
cookie=0x0,  table=101,   priority=16,ip,reg0=0x7f33b6,nw_dst=10.0.17.1 actions=output:2
cookie=0x0,  table=101,   priority=15,ip,reg0=0x7f33b6,nw_dst=10.254.0.100 actions=output:2
cookie=0x0,  table=101,   priority=14,ip,reg0=0x7f33b6,nw_dst=10.0.17.199 actions=output:2
cookie=0x0,  table=101,   priority=13,ip,reg0=0x7f33b6,nw_dst=10.254.20.100 actions=output:2
cookie=0x0,  table=101,   priority=12,ip,reg0=0x7f33b6,nw_dst=10.0.21.59 actions=output:2
cookie=0x0,  table=101,   priority=11,ip,reg0=0x7f33b6,nw_dst=10.253.0.28 actions=output:2
cookie=0x0,  table=101,   priority=10,ip,reg0=0x7f33b6,nw_dst=10.253.2.101 actions=output:2
cookie=0x0,  table=101,   priority=9,ip,reg0=0x7f33b6,nw_dst=10.253.2.122 actions=output:2
cookie=0x0,  table=101,   priority=8,ip,reg0=0x7f33b6,nw_dst=10.0.34.105 actions=output:2
cookie=0x0,  table=101,   priority=7,ip,reg0=0x7f33b6,nw_dst=10.239.98.125 actions=output:2
cookie=0x0,  table=101,   priority=6,ip,reg0=0x7f33b6,nw_dst=10.239.98.125 actions=output:2

[root@ocpd00068 flows]# grep 0x7f33b6 ./flows.2018_03_09_21:54_53 | grep table=101
		cookie=0x1,  table=101,   priority=65535,reg0=0x7f33b6 actions=drop
cookie=0x0,  table=101,   priority=21,ip,reg0=0x7f33b6,nw_dst=10.254.4.130 actions=output:2
cookie=0x0,  table=101,   priority=20,ip,reg0=0x7f33b6,nw_dst=10.254.0.101 actions=output:2
cookie=0x0,  table=101,   priority=19,ip,reg0=0x7f33b6,nw_dst=10.0.27.1 actions=output:2
cookie=0x0,  table=101,   priority=18,ip,reg0=0x7f33b6,nw_dst=10.239.98.20 actions=output:2
cookie=0x0,  table=101,   priority=17,ip,reg0=0x7f33b6,nw_dst=10.0.17.2 actions=output:2
cookie=0x0,  table=101,   priority=16,ip,reg0=0x7f33b6,nw_dst=10.0.17.1 actions=output:2
cookie=0x0,  table=101,   priority=15,ip,reg0=0x7f33b6,nw_dst=10.254.0.100 actions=output:2
cookie=0x0,  table=101,   priority=14,ip,reg0=0x7f33b6,nw_dst=10.0.17.199 actions=output:2
cookie=0x0,  table=101,   priority=13,ip,reg0=0x7f33b6,nw_dst=10.254.20.100 actions=output:2
cookie=0x0,  table=101,   priority=12,ip,reg0=0x7f33b6,nw_dst=10.0.21.59 actions=output:2
cookie=0x0,  table=101,   priority=11,ip,reg0=0x7f33b6,nw_dst=10.253.0.28 actions=output:2
cookie=0x0,  table=101,   priority=10,ip,reg0=0x7f33b6,nw_dst=10.253.2.101 actions=output:2
cookie=0x0,  table=101,   priority=9,ip,reg0=0x7f33b6,nw_dst=10.253.2.122 actions=output:2
cookie=0x0,  table=101,   priority=8,ip,reg0=0x7f33b6,nw_dst=10.0.34.105 actions=output:2
cookie=0x0,  table=101,   priority=7,ip,reg0=0x7f33b6,nw_dst=10.239.98.125 actions=output:2
cookie=0x0,  table=101,   priority=6,ip,reg0=0x7f33b6,nw_dst=10.239.98.125 actions=output:2
cookie=0x0,  table=101,   priority=5,ip,reg0=0x7f33b6,nw_dst=10.254.7.66 actions=output:2
cookie=0x0,  table=101,   priority=4,ip,reg0=0x7f33b6,nw_dst=10.254.4.66 actions=output:2
cookie=0x0,  table=101,   priority=3,ip,reg0=0x7f33b6,nw_dst=10.252.2.2 actions=output:2
cookie=0x0,  table=101,   priority=2,ip,reg0=0x7f33b6,nw_dst=10.254.0.83 actions=output:2
cookie=0x0,  table=101,   priority=1,ip,reg0=0x7f33b6 actions=drop

So diff between traffic in first file and second file is 1second, and we can see rules is being populated as second file as more "records". Because drop rule exist we cant go out. there was cases when this "table requild" tooks up to 6 seconds.

@danwinship
Copy link
Contributor

This part of code get the long execution and hangs (?)
https://github.com/openshift/origin/blob/master/pkg/network/node/ovscontroller.go#L447-L487

DNS should be taken from the cache, but by any change, it can be blocked by this:
https://github.com/openshift/origin/blob/master/pkg/network/node/ovscontroller.go#L490

490 is

if err := common.CheckDNSResolver(); err != nil {

which just checks if dig is installed. The delay is probably coming earlier, at:

ips := egressDNS.GetIPs(policies[0], rule.To.DNSName)

because that has to wait for a lock, so it might block on the resync in another thread. We need to rewrite this to generate the rules first and then write them out all at once after.

@mjudeikis
Copy link
Contributor Author

@danwinship what is probably this will happen soon? if its low I might just do it myself..

@danwinship
Copy link
Contributor

I'll probably get to it soonish

@mojsha
Copy link

mojsha commented Apr 12, 2018

@danwinship We use the ovs-multitenant plugin and have an issue when using the EgressNetworkPolicy objects. It happens "randomly" but basically the issue is that all access to a pod that has been restarted after the incident (I'm calling the point at which the bug starts its effect the "incident"), or a newly provisioned pod, is basically unreachable by other pods. Eventually the entire cluster becomes unusable and we have to restart the nodes for everything to work again. After removing all EgressNetworkPolicy objects, the problem goes away.

Are we understanding it correctly that you are dropping all traffic to get a snapshot/dump of the OVS rules, and then do some manipulation and write down the changes, and then allow traffic to start again? And could this be a reason for the abovementioned?

@mjudeikis
Copy link
Contributor Author

@mojsha do you use upstream organization DNS for this?
Are they you using nodePort?
Do you use full FQDN names in your EgressNetworkPolicy file, or just IP's?

@mojsha
Copy link

mojsha commented Apr 12, 2018

@mjudeikis We use OpenShift's SkyDNS in combination with an upstream DNS for non-cluster addresses.
We use a mix of ClusterIP and NodePort. Does it make any difference what type it is?
We use a combination of DNS (FQDN & non-FQDN), IPs and CIDR-addresses in the EgressNetworkPolicy-objects.

@danwinship
Copy link
Contributor

@mojsha that's not this bug. This bug only results in temporary disruptions.

Your bug sounds like #13965, which was fixed in 3.7. Are you running a really old version of origin?

@mojsha
Copy link

mojsha commented Apr 12, 2018

@danwinship No, I'm using 3.6, and we are moving to 3.7.2. What makes you think that this issue was fixed in 3.7 (it's not apparent when I'm looking at the issue)?

Can you confirm that the traffic is temporary "blocked"? If so, is there a work-around for this? I am thinking of our production setup where we will be using this and we don't want any service interruption.

@danwinship
Copy link
Contributor

I know it's fixed in 3.7 from when it was committed.

There's no workaround for the bug discussed here, but the fix will probably get backported to 3.7. (OCP 3.7 that is, not Origin. But you shouldn't be running old versions of Origin.)

@mojsha
Copy link

mojsha commented Apr 13, 2018

Fair enough. We're in the process of moving to OCP.

Can you clarify my second paragraph regarding whether update of an EgressNetworkPolicy object will result in network connectivity being dropped/stopped for a short duration and will lead to service interruption if there is an ongoing connection/session or a new incoming one?

@mjudeikis
Copy link
Contributor Author

@danwinship bz:1558484

@danwinship
Copy link
Contributor

@mojsha most users have not noticed any disruption. It may be that the disruption only occurs if, eg, you have a DNS-based rule and the hostname it refers to has slow/flaky DNS servers.

Traffic is silently dropped during the interruption; it is not actively rejected. Assuming the interruption is brief, then the packets will get retried and you won't even notice.

@mjudeikis
Copy link
Contributor Author

@mojsha I can confirm that cluster I observed this contains hundreds of Egress rules, and they use FQDN's, where DNS is not the fastest. It's not perfect, but the impact is minimal if you dont over abuse Egress rules. So know the limitations, but overall its not critical.

If we move to full IP's based rules, there is no issue at all.

@mojsha
Copy link

mojsha commented Apr 13, 2018

@danwinship @mjudeikis Thanks guys. Makes me much more comfortable re-enabling the usage of EgressNetworkPolicy objects once we complete the upgrade to 3.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants