-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After running oc cluster up and creating a pod/service, the newly created pod can't connect outbound to its own service IP:port #12111
Comments
@knobunc do you have any advice on who could help with this? |
@jim-minter can you please grab a debug tar using the script referenced at https://docs.openshift.com/enterprise/3.2/admin_guide/sdn_troubleshooting.html#further-help @danwinship can you think of anything funny that would happen in this case? |
Please see https://dl.dropboxusercontent.com/u/13429758/openshift-sdn-debug-2016-12-02.tgz . Hopefully it'll be useful - I don't know if openshift-sdn-debug is entirely geared up to run with oc cluster up installs.
|
Hmm, I had thought this affected both "oc cluster up" and "openshift start" type one-box setups, but it appears to be related specifically to "oc cluster up". |
From a networking point of view, the main difference is that one is running in a container (with --net=host and --privileged) and the other one is not. |
hm... "oc cluster up" doesn't work for me on F23. (looks like #9470). Anyway, the issue is probably that we aren't setting --hairpin-mode to the right value, but I don't know how to override that when using "oc cluster up"... |
@danwinship is that a kubelet argument? |
yes |
@danwinship I didn't realise how close I was with my comment about setting ifconfig docker0 promisc :-) |
from the server logs:
|
Previous comment is a red herring, I think. When running via oc cluster up, cat /sys/devices/virtual/net/docker0/brif/veth*/hairpin_mode shows all 0s. When running via openshift start, all 1s. |
hairpin-veth sounds right... maybe dockermanager just isn't able to set it (doesn't have access to /sys/devices/virtual/net maybe?). Are there any other hairpin-related messages in the logs? |
Aha! 5909 openat(AT_FDCWD, "/sys/devices/virtual/net/veth59a9a17/brport/hairpin_mode", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0644) = -1 EROFS (Read-only file system) Will get a PR for that. |
@danwinship @knobunc thanks for your help with this! |
@csrwng can confirm a Also separately, briefly, what is the rationale behind the /rootfs bind? |
@danwinship please could you submit a new issue for this with versions used, --loglevel=5 and I'll take a look. I'd rather not see this drop if possible. |
@jim-minter I'm seeing this behavior again in recent versions of openshift. On linux, running @danwinship @knobunc do you have any suggestions for further debugging? |
@csrwng getting an strace might show what's going on - you'll probably need to run strace -f -o <output_file> -p on the dockerd. |
@csrwng (start the strace before running oc cluster up) |
or possibly even just --loglevel=5 logs. presumably it's failing to set hairpin mode again for some reason |
Here's the server log with --loglevel=5 The strace is here (I compressed it because it was 26 mb): |
node logs, not master |
@danwinship it's all-in-one so it's both node and master |
Some additional info ... did an unscientific test on @bparees's machine (which has docker 1.10), and accessing the service works from the pod. So this seems broken only on 1.12 which is what I'm running on my RHEL box. To summarize: @jwhonce do you know of any changes in Docker that affect pod networking? |
Those logs are incomplete in some way... eg, there's nothing logged from kubelet.go (and in particular no line 'Hairpin mode set to "..."') maybe relevant:
|
let me try again |
Strange, if I don't set the --loglevel, then I get the log about the hairpin mode, otherwise I don't |
@csrwng on latest F25, docker-1.12.6-5.git037a2f5.fc25.x86_64, oc from https://github.com/openshift/origin/releases/download/v1.5.0-alpha.2/openshift-origin-client-tools-v1.5.0-alpha.2-e4b43ee-linux-64bit.tar.gz, it works for me. I also get the logs @danwinship notes. Am going to need more info to reproduce here. |
ok, some more info ... Right after starting the cluster,
is showing 0. However, I exec'd into the origin container:
Then wrote a 1 to `/sys/devices/virtual/net/veth*/brport/hairpin_mode' Somehow my kubelet is skipping the step of writing a 1 to that file. |
Found it. So the issue is that we're hard-coding docker's root directory |
cool, so that's an easy fix, now to figure out why it's completely broken on the mac :) |
So I found out that only Docker 1.13.0 is broken on the mac. Docker 1.13.1-rc1 is not. I thought it could be related to them changing the default policy for FORWARD from ACCEPT to DROP in the filter table. However the same policy is in place in 1.13.1-rc1 and that's working fine. |
@csrwng - I'm using OC client tool version 3.7 and latest docker version in RHEL. Please find the instructions that I have used to install the Docker and OC client Docker Install OC client Install After installing the clients, performed oc cluster up command and then added containers to the single node cluster. The containers within the host are unable to communicate with each other. I'm getting the above issue. But when I incorporate @jim-minter comment i.e. adding the ifconfig docker0 promisc, it is working fine. So whether adding this line is the recommended approach or any permanent fix available. |
@msenmurugan that sounds like the original issue we had with the kubelet not able to update the virtual network device hairpin_mode, which is why we added this bind mount to the origin container: |
I haven't been able to reproduce an issue using the latest AWS RHEL AMI and the instructions above. @msenmurugan when you say "The containers within the host are unable to communicate with each other", what are you trying to do, what are you expecting to happen, and what happens? If you're still having problems, I suggest opening a new issue. $ cat >/etc/docker/daemon.json <<EOF
{
"insecure-registries": ["172.30.0.0/16"]
}
EOF
$ service docker restart
$ oc cluster up --loglevel=2
$ oc login -u system:admin
$ oc project default
$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
docker-registry-1-wdnkw 1/1 Running 0 53s 172.17.0.5 localhost
persistent-volume-setup-gl8l5 1/1 Running 0 1m 172.17.0.2 localhost
router-1-k8tg9 1/1 Running 0 52s 172.18.14.147 localhost
$ oc get svc
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
docker-registry 172.30.1.1 <none> 5000/TCP 2m
kubernetes 172.30.0.1 <none> 443/TCP,53/UDP,53/TCP 2m
router 172.30.34.4 <none> 80/TCP,443/TCP,1936/TCP 2m
$ oc rsh router-1-k8tg9
sh-4.2$ curl -D - 172.17.0.5:5000/healthz
HTTP/1.1 200 OK
sh-4.2$ curl -D - 172.30.1.1:5000/healthz
HTTP/1.1 200 OK |
Hi. We've faced the same issue in Eclipse Che on OCP (docker-based OCP). Can you give any advice on how to nail down this issue? |
@garagatyi try bringing the cluster down and restarting Docker (not restarting the machine). There seems to be an issue with iptables and Docker when it's first brought up by systemd. If you restart Docker after your machine is running, it fixes itself. |
Thank you for the suggestion but it didn't help. |
BTW docker version is 17.09.1-ce |
@msenmurugan and @garagatyi thank-you both for your report, I've now been able to reproduce this. It's a new issue in an old area; I have opened https://bugzilla.redhat.com/show_bug.cgi?id=1535510 to track. |
This issue is still present with Origin v3.11 running on Docker 19.03.1, OpenSuSE Tw. Not sure why this is closed. The workaround resolves the issue but seems like an ugly hack to me: |
Running on the latest origin vagrant VM + dnf -y update, with latest origin master, after running oc cluster up and creating a pod/service, the newly created pod can't connect outbound to its own service IP:port.
Version
Steps To Reproduce
Current Result
hangs
Expected Result
outputs the text "ok"
Additional Information
Two interesting additional pieces of information:
ifconfig docker0 promisc
on the host, things start working fully as expectedThe text was updated successfully, but these errors were encountered: