bump(*): update etcd to 3.2.16 and grpc to 1.7.5 #18731

sttts · 2018-02-23T12:52:27Z

Replaces #18660 by fixing an upstream bug with the testing etcd.

Copied from #18660:

Update of etcd level from 3.2.8 to 3.2.16 and gRPC to 1.7.5 (matching etcd version).

Fixes: #18496

etcd:

old origin: 3.2.8
new origin: 3.2.16
kube in 1.9: 3.1.11
kube in 1.10: 3.2.13 (Version bump to etcd v3.2.13, grpc v1.7.5 kubernetes/kubernetes#57480)

grpc:

old origin: 1.3.0
new origin: 1.7.5
kube in 1.9: 1.3.0
kube in 1.10: 1.7.5 (Version bump to etcd v3.2.13, grpc v1.7.5 kubernetes/kubernetes#57480)

List of interesting changes or changes related to gRPC:

3.2.10: etcd-io/etcd@6d40628: update grpc, grpc-gateway (1.4.2 -> 1.7.3)
3.2.10: etcd-io/etcd@a8c84ff: clientv3: fix client balancer with gRPC v1.7
3.2.10: etcd-io/etcd@939337f: add max requests bytes, keepalive to server, blackhole methods to integration
3.2.10: etcd-io/etcd@8de0c04: Switch from boltdb v1.3.0 to coreos/bbolt v1.3.1-coreos.3 (<- concerning?)
3.2.11: etcd-io/etcd@5921b2c: log grpc stream send/recv errors in server-side
3.2.11: etcd-io/etcd@ff1f08c: upgrade grpc/grpc-go to v1.7.4
3.2.12: etcd-io/etcd@e82f055: clientv3: configure gRPC message limits in Config
3.2.12: etcd-io/etcd@c67e6d5: clientv3: call KV/Txn APIs with default gRPC call options
3.2.12: etcd-io/etcd@348b25f: clientv3: call other APIs with default gRPC call options
3.2.13: etcd-io/etcd@288ef7d: embed: fix gRPC server panic on GracefulStop
3.2.16: etcd-io/etcd@e08abbe: mvcc: restore unsynced watchers

@smarterclayton @deads2k @liggitt maybe too late in 3.9 cycle, but I don't see any huge risk change and this is a minor version bump that contain plenty of bug fixes.

This was clean bump, no build errors or panics during server start. There were 0 picks/carries on grpc or etcd.

@deads2k I wonder if we need to add grpc into our glide.yaml... If I don't and just bump the etcd, there are no changes in grpc, just etcd. I was worried that whatever higher level client we have that use grpc will use different version than etcd?

sttts · 2018-02-23T12:59:07Z

Upstream kubernetes/kubernetes#60299

sttts · 2018-02-23T13:19:05Z

@deads2k git diff-tree -r <bump-commit>: https://gist.github.com/sttts/23e37a7e314007a26647a457f8229fda

deads2k · 2018-02-23T14:00:16Z

@juanvallejo this is highlighting shortcomings in the dep set detection. bolt really shouldn't be considered "ours". We'll never pin it. We have to think of a different rule.

@sttts thanks for the list
@soltysh @mfojtik we need to get that list from people so we can be sure nothing is getting accidentally reverted.

deads2k · 2018-02-23T14:00:47Z

@soltysh @mfojtik we need to get that list from people so we can be sure nothing is getting accidentally reverted.

And once we get the publisher bot going and the the dep set detection stuff completed, I think we'll be in good shape.

liggitt · 2018-02-23T14:09:07Z

still timing out:

FAIL	github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/master	122.046s

sttts · 2018-02-23T14:11:47Z

still timing out:

Which go version do we use? 1.9? I saw a panic in 1.10 deep inside etcd with the consequence of the test to time out.

sttts · 2018-02-23T14:13:46Z

still timing out:

I also dropped @mfojtik's wip timeout commit. So this might have an influence here, independently from my commit.

liggitt · 2018-02-23T14:23:45Z

yes, go 1.9

liggitt · 2018-02-23T14:24:10Z

I also dropped @mfojtik's wip timeout commit.

sure, though before this PR, that package took ~9 seconds, and with this PR, it takes ~450 (timing out at 120s by default)

sttts · 2018-02-23T14:37:22Z

and with this PR, it takes ~450 (timing out at 120s by default)

Ouch, has anybody looked into the reason? Now I understand @mfojtik's 600 in that commit.

liggitt · 2018-02-23T14:39:50Z

Ouch, has anybody looked into the reason? Now I understand @mfojtik's 600 in that commit.

got as far as isolating which tests jumped in time, hadn't dug beyond that - #18660 (comment)

sttts · 2018-02-23T17:39:09Z

I cannot reproduce the long runtimes locally.

What I found in TestLegacyRestStorageStrategies is that the integration etcd server termianted early:

=== RUN   TestLegacyRestStorageStrategies
2018-02-23 13:41:15.895839 I | integration: launching 7927324665964986513 (unix://localhost:79273246659649865130)
2018-02-23 13:41:15.902644 I | etcdserver: name = 7927324665964986513
2018-02-23 13:41:15.902712 I | etcdserver: data dir = /tmp/etcd595091931
2018-02-23 13:41:15.902764 I | etcdserver: member dir = /tmp/etcd595091931/member
2018-02-23 13:41:15.902812 I | etcdserver: heartbeat = 10ms
2018-02-23 13:41:15.902860 I | etcdserver: election = 100ms
2018-02-23 13:41:15.902901 I | etcdserver: snapshot count = 0
2018-02-23 13:41:15.902970 I | etcdserver: advertise client URLs = unix://127.0.0.1:2100215573
2018-02-23 13:41:15.903028 I | etcdserver: initial advertise peer URLs = unix://127.0.0.1:2100115573
2018-02-23 13:41:15.903103 I | etcdserver: initial cluster = 7927324665964986513=unix://127.0.0.1:2100115573
2018-02-23 13:41:15.910571 I | etcdserver: starting member 6acce32c272f622e in cluster fdce2b555e23edf3
2018-02-23 13:41:15.910694 I | raft: 6acce32c272f622e became follower at term 0
2018-02-23 13:41:15.910790 I | raft: newRaft 6acce32c272f622e [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2018-02-23 13:41:15.910853 I | raft: 6acce32c272f622e became follower at term 1
2018-02-23 13:41:15.921585 W | auth: simple token is not cryptographically signed
2018-02-23 13:41:15.927332 I | etcdserver: set snapshot count to default 100000
2018-02-23 13:41:15.927404 I | etcdserver: starting server... [version: 3.2.16, cluster version: to_be_decided]
2018-02-23 13:41:15.932085 I | integration: launched 7927324665964986513 (unix://localhost:79273246659649865130)
2018-02-23 13:41:15.937529 I | etcdserver/membership: added member 6acce32c272f622e [unix://127.0.0.1:2100115573] to cluster fdce2b555e23edf3
2018-02-23 13:41:16.001594 I | raft: 6acce32c272f622e is starting a new election at term 1
2018-02-23 13:41:16.001718 I | raft: 6acce32c272f622e became candidate at term 2
2018-02-23 13:41:16.001796 I | raft: 6acce32c272f622e received MsgVoteResp from 6acce32c272f622e at term 2
2018-02-23 13:41:16.001878 I | raft: 6acce32c272f622e became leader at term 2
2018-02-23 13:41:16.001945 I | raft: raft.node: 6acce32c272f622e elected leader 6acce32c272f622e at term 2
2018-02-23 13:41:16.014811 I | etcdserver: setting up the initial cluster version to 3.2
2018-02-23 13:41:16.015438 N | etcdserver/membership: set the initial cluster version to 3.2
2018-02-23 13:41:16.018706 I | etcdserver: published {Name:7927324665964986513 ClientURLs:[unix://127.0.0.1:2100215573]} to cluster fdce2b555e23edf3
2018-02-23 13:41:16.022316 I | etcdserver/api: enabled capabilities for version 3.2
2018-02-23 13:41:17.873216 I | integration: terminating 7927324665964986513 (unix://localhost:79273246659649865130)
WARNING: 2018/02/23 13:41:17 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial unix localhost:79273246659649865130: connect: no such file or directory"; Reconnecting to {localhost:79273246659649865130 0  <nil>}
...
panic: test timed out after 2m0s

sttts · 2018-02-23T18:03:12Z

/retest

sttts · 2018-02-23T18:03:43Z

Running a test locally in an infinite loop. Maybe this early termination is just a flake.

sttts · 2018-02-23T19:30:11Z

Running a test locally in an infinite loop. Maybe this early termination is just a flake.

does not help.

mfojtik · 2018-02-23T20:28:19Z

@sttts I swear I saw a comment in kubernetes claiming that all these etcd messages are "log spam".

sttts · 2018-02-24T11:04:29Z

Latest finding: the termination of the cluster is the normal defer Terminate call of the test. I.e. it does not return in time.

Continueing digging.

sttts · 2018-02-24T12:54:32Z

This blocks during termination:

origin/vendor/github.com/coreos/etcd/integration/cluster.go

Line 781 in 1750fea

m.grpcBridge.Close()

mfojtik · 2018-02-24T17:59:30Z

/retest

mfojtik · 2018-02-26T12:16:52Z

@sttts the grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial unix localhost:27365759778032236110: connect: no such file or directory"; Reconnecting to {localhost:27365759778032236110 0 <nil>} is weird. If I understand the code under it, we should not see these message if the grpc connections are marked as "shutdown". It might be that we forgetting to close something when the server is terminated?

sttts · 2018-02-26T12:22:54Z

@mfojtik something is blocked on shutdown definitely, leading to those messages. But the messages themselves are not the issue.

mfojtik · 2018-02-26T14:37:53Z

update: I added defer func() { go s.Terminate() } to rule out the termination as a cause for the delay, but it barely cut ~10s from the runtime:

# on master branch:
~/.../openshift/origin → go test ./vendor/k8s.io/kubernetes/pkg/master/
ok  	github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/master	87.359s
# on this branch:
~/.../openshift/origin → go test ./vendor/k8s.io/kubernetes/pkg/master/
ok  	github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/master	167.152s
# on this branch with async terminate
~/.../openshift/origin → go test ./vendor/k8s.io/kubernetes/pkg/master/
ok  	github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/master	156.632s

That means the cause is not Terminate and this need more debugging :-(

mfojtik · 2018-02-26T14:42:37Z

@mfojtik something is blocked on shutdown definitely, leading to those messages. But the messages themselves are not the issue.

@sttts +1 the messages seems to be just log spam, but what is interesting is that I would expect these messages to show the status of the connection as "shutdown" and not "connecting"... IOW. when we terminate the etcd server, we also go terminate all connections and etcd server should terminate all connections to grpc... In code that means the connections status is updated to "shutdown" and we should not see that log messages... The fact we see them might indicate that something is not right in the termination code. However, even if we make that termination non-blocking in defer, the time improves barely ~10s which also indicates that the problem might be somewhere else.

deads2k · 2018-02-26T15:40:07Z

It appears to be the StorageWithCacher. If you running with that in your unit tests, the destroy isn't being called and its not failing nicely.

deads2k · 2018-02-26T15:48:13Z

Try pulling in kubernetes/kubernetes#60430

mfojtik · 2018-02-26T15:53:55Z

@deads2k tested, verified this make this test to run in 87s vs. 167s before.

mfojtik · 2018-02-26T19:39:31Z

/approve no-issue

openshift-ci-robot · 2018-02-26T19:39:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mfojtik, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mfojtik]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mfojtik · 2018-02-26T21:35:13Z

all tests are green

/lgtm

mfojtik · 2018-02-26T21:35:33Z

@smarterclayton FYI

openshift-merge-robot · 2018-02-27T02:28:56Z

/test all [submit-queue is verifying that this PR is safe to merge]

liggitt · 2018-02-27T05:17:36Z

/retest

smarterclayton · 2018-02-27T06:44:13Z

/retest

smarterclayton · 2018-02-27T06:44:24Z

that which is not closed can eternal flake

mfojtik · 2018-02-27T08:57:25Z

/retest

mfojtik · 2018-02-27T10:24:52Z

The e2e test failure seems to be permanent break, I asked @miminar and @legionus to investigate. We can't hold this until it is resolved, so I'm green button merging this.

Issue: #18762

openshift-ci-robot · 2018-02-27T11:09:52Z

@sttts: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/openshift-jenkins/extended_networking_minimal	`373bdee`	link	`/test extended_networking_minimal`
ci/openshift-jenkins/end_to_end	`373bdee`	link	`/test end_to_end`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 23, 2018

openshift-ci-robot requested review from liggitt and soltysh February 23, 2018 12:52

sttts assigned mfojtik and liggitt Feb 23, 2018

openshift-merge-robot added the vendor-update Touching vendor dir or related files label Feb 23, 2018

sttts added dependency/etcd priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed vendor-update Touching vendor dir or related files labels Feb 23, 2018

openshift-merge-robot added the vendor-update Touching vendor dir or related files label Feb 23, 2018

mfojtik added 2 commits February 23, 2018 14:04

bump(*): update etcd to 3.2.16 and grpc to 1.7.5

fa4d4a1

UPSTREAM: 57480: Fix build and test errors from etcd 3.2.13 upgrade

229bfd7

sttts force-pushed the sttts-etcd-3.2.16 branch from f350412 to 736404c Compare February 23, 2018 13:05

sttts force-pushed the sttts-etcd-3.2.16 branch from bb11a99 to 420502c Compare February 23, 2018 20:17

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 26, 2018

mfojtik added this to the 3.9.0 milestone Feb 26, 2018

sttts and others added 2 commits February 26, 2018 21:03

UPSTREAM: 60299: apiserver: fix testing etcd config for etcd 3.2.16

3050e6c

UPSTREAM: 60430: don't use storage cache during apiserver unit test

373bdee

sttts force-pushed the sttts-etcd-3.2.16 branch from dc67aaa to 373bdee Compare February 26, 2018 20:05

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2018

mfojtik closed this Feb 27, 2018

mfojtik reopened this Feb 27, 2018

mfojtik merged commit 9e7dc61 into openshift:master Feb 27, 2018

bump(*): update etcd to 3.2.16 and grpc to 1.7.5 #18731

bump(*): update etcd to 3.2.16 and grpc to 1.7.5 #18731

Conversation

sttts commented Feb 23, 2018 • edited Loading

sttts commented Feb 23, 2018

sttts commented Feb 23, 2018

deads2k commented Feb 23, 2018

deads2k commented Feb 23, 2018

liggitt commented Feb 23, 2018

sttts commented Feb 23, 2018

sttts commented Feb 23, 2018 • edited Loading

liggitt commented Feb 23, 2018

liggitt commented Feb 23, 2018 • edited Loading

sttts commented Feb 23, 2018

liggitt commented Feb 23, 2018

sttts commented Feb 23, 2018

sttts commented Feb 23, 2018

sttts commented Feb 23, 2018

sttts commented Feb 23, 2018

mfojtik commented Feb 23, 2018

sttts commented Feb 24, 2018

sttts commented Feb 24, 2018

mfojtik commented Feb 24, 2018

mfojtik commented Feb 26, 2018

sttts commented Feb 26, 2018

mfojtik commented Feb 26, 2018

mfojtik commented Feb 26, 2018

deads2k commented Feb 26, 2018

deads2k commented Feb 26, 2018

mfojtik commented Feb 26, 2018

mfojtik commented Feb 26, 2018

openshift-ci-robot commented Feb 26, 2018

mfojtik commented Feb 26, 2018

mfojtik commented Feb 26, 2018

openshift-merge-robot commented Feb 27, 2018

liggitt commented Feb 27, 2018

smarterclayton commented Feb 27, 2018

smarterclayton commented Feb 27, 2018

mfojtik commented Feb 27, 2018

mfojtik commented Feb 27, 2018 • edited Loading

openshift-ci-robot commented Feb 27, 2018 • edited Loading

sttts commented Feb 23, 2018 •

edited

Loading

sttts commented Feb 23, 2018 •

edited

Loading

liggitt commented Feb 23, 2018 •

edited

Loading

mfojtik commented Feb 27, 2018 •

edited

Loading

openshift-ci-robot commented Feb 27, 2018 •

edited

Loading