Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bump(*): update etcd to 3.2.16 and grpc to 1.7.5 #18660

Closed
wants to merge 3 commits into from

Conversation

mfojtik
Copy link
Contributor

@mfojtik mfojtik commented Feb 19, 2018

Update of etcd level from 3.2.8 to 3.2.16 and gRPC to 1.7.5 (matching etcd version).

Fixes: #18496

etcd:

grpc:

List of interesting changes or changes related to gRPC:

3.2.10: etcd-io/etcd@6d40628: update grpc, grpc-gateway (1.4.2 -> 1.7.3)
3.2.10: etcd-io/etcd@a8c84ff: clientv3: fix client balancer with gRPC v1.7
3.2.10: etcd-io/etcd@939337f: add max requests bytes, keepalive to server, blackhole methods to integration
3.2.10: etcd-io/etcd@8de0c04: Switch from boltdb v1.3.0 to coreos/bbolt v1.3.1-coreos.3 (<- concerning?)
3.2.11: etcd-io/etcd@5921b2c: log grpc stream send/recv errors in server-side
3.2.11: etcd-io/etcd@ff1f08c: upgrade grpc/grpc-go to v1.7.4
3.2.12: etcd-io/etcd@e82f055: clientv3: configure gRPC message limits in Config
3.2.12: etcd-io/etcd@c67e6d5: clientv3: call KV/Txn APIs with default gRPC call options
3.2.12: etcd-io/etcd@348b25f: clientv3: call other APIs with default gRPC call options
3.2.13: etcd-io/etcd@288ef7d: embed: fix gRPC server panic on GracefulStop
3.2.16: etcd-io/etcd@e08abbe: mvcc: restore unsynced watchers

@smarterclayton @deads2k @liggitt maybe too late in 3.9 cycle, but I don't see any huge risk change and this is a minor version bump that contain plenty of bug fixes.

This was clean bump, no build errors or panics during server start. There were 0 picks/carries on grpc or etcd.

@deads2k I wonder if we need to add grpc into our glide.yaml... If I don't and just bump the etcd, there are no changes in grpc, just etcd. I was worried that whatever higher level client we have that use grpc will use different version than etcd?

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mfojtik

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 19, 2018
@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 19, 2018
@mfojtik mfojtik requested review from liggitt, smarterclayton and deads2k and removed request for bparees and stevekuznetsov February 19, 2018 10:33
@mfojtik mfojtik added this to the 3.9.0 milestone Feb 19, 2018
@openshift-merge-robot openshift-merge-robot added the vendor-update Touching vendor dir or related files label Feb 19, 2018
@@ -1303,7 +1310,7 @@ imports:
- name: gopkg.in/yaml.v2
version: 53feefa2559fb8dfa8d81baad31be332c97d6c77
- name: k8s.io/api
version: 78d52114c359511e7d0d5af4fc50148b5392d7b4
version: 30a449993d4d38589fa79362431bd9b04644cab5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k this is probably OK, reflecting the branches that are now synced by publishing bot.

@@ -93,7 +93,11 @@ import:
version: ded73eae5db7e7a0ef6f55aace87a2873c5d2b74
# master
- package: github.com/coreos/etcd
version: v3.2.8
version: v3.2.16
- package: google.golang.org/grpc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pinning the version of grpc to the version used by etcd 3.2.16

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that be discovered by glide?

Copy link
Contributor Author

@mfojtik mfojtik Feb 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surprisingly not... when I just bumped etcd, these were unchanged (on disk).

Copy link
Contributor Author

@mfojtik mfojtik Feb 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is strange is that the HEAD version of grpc is 1.3.0, but the etcd 3.2.8 version of grpc is 1.2.1 (from https://github.com/coreos/etcd/blob/v3.2.8/glide.yaml#L99). That means that in HEAD we are ahead of what our level of etcd was using.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I just bumped etcd, these were unchanged (on disk).

did you unpin these first? if not, pinned means pinned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt i don't see how they are currently pinned, we don't mention them in glide.yaml

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mfojtik
Copy link
Contributor Author

mfojtik commented Feb 19, 2018

/retest

@mfojtik
Copy link
Contributor Author

mfojtik commented Feb 19, 2018

For the unit test, there is no failure reported in log but it is reported as failed because of build failed? Looks like #18497

@mfojtik
Copy link
Contributor Author

mfojtik commented Feb 19, 2018

/test all

@mfojtik
Copy link
Contributor Author

mfojtik commented Feb 19, 2018

hrm. ok the unit tests are maybe not flaking bug failing to build but I haven't found the reason yet.,

@mfojtik
Copy link
Contributor Author

mfojtik commented Feb 19, 2018

/retest

@stevekuznetsov
Copy link
Contributor

@mfojtik if that is 1000% reproducible on test unit let's debug today

@mfojtik
Copy link
Contributor Author

mfojtik commented Feb 19, 2018

@stevekuznetsov seems like this PR didn't have a single green unit test run...

@mfojtik
Copy link
Contributor Author

mfojtik commented Feb 19, 2018

@stevekuznetsov ok, I found a panic in vendor/k8s.io/kubernetes/pkg/master :-)

@mfojtik
Copy link
Contributor Author

mfojtik commented Feb 19, 2018

interesting:

2018-02-19 15:09:33.516913 I | integration: terminating 6914312482561710168 (unix://localhost:69143124825617101680)
WARNING: 2018/02/19 15:09:33 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial unix localhost:69143124825617101680: connect: no such file or directory"; Reconnecting to {localhost:69143124825617101680 0  <nil>}
WARNING: 2018/02/19 15:09:33 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial unix localhost:69143124825617101680: connect: no such file or directory"; Reconnecting to {localhost:69143124825617101680 0  <nil>}
WARNING: 2018/02/19 15:09:33 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial unix localhost:69143124825617101680: connect: no such file or directory"; Reconnecting to {localhost:69143124825617101680 0  <nil>}
WARNING: 2018/02/19 15:09:33 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial unix localhost:69143124825617101680: connect: no such file or directory"; Reconnecting to {localhost:69143124825617101680 0  <nil>}
WARNING: 2018/02/19 15:09:33 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial unix localhost:69143124825617101680: connect: no such file or directory"; Reconnecting to {localhost:69143124825617101680 0  <nil>}
WARNING: 2018/02/19 15:09:33 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial unix localhost:69143124825617101680: connect: no such file or directory"; Reconnecting to {localhost:69143124825617101680 0  <nil>}
WARNING: 2018/02/19 15:09:33 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial unix localhost:69143124825617101680: connect: no such file or directory"; Reconnecting to {localhost:69143124825617101680 0  <nil>}
WARNING: 2018/02/19 15:09:33 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial unix localhost:69143124825617101680: connect: no such file or directory"; Reconnecting to {localhost:69143124825617101680 0  <nil>}
WARNING: 2018/02/19 15:09:33 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial unix localhost:69143124825617101680: connect: no such file or directory"; Reconnecting to {localhost:69143124825617101680 0  <nil>}
WARNING: 2018/02/19 15:09:33 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial unix localhost:69143124825617101680: connect: no such file or directory"; Reconnecting to {localhost:69143124825617101680 0  <nil>}
WARNING: 2018/02/19 15:09:33 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial unix localhost:69143124825617101680: connect: no such file or directory"; Reconnecting to {localhost:69143124825617101680 0  <nil>}
WARNING: 2018/02/19 15:09:33 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"

EDIT: According to kube these are just log spam (?!).

@liggitt
Copy link
Contributor

liggitt commented Feb 19, 2018

might need to pull in kube changes from a similar bump:

https://github.com/kubernetes/kubernetes/pull/57480/commits

@liggitt
Copy link
Contributor

liggitt commented Feb 19, 2018

does updating grpc have implications for containerd, cadvisor, or kubelet APIs?

@smarterclayton
Copy link
Contributor

/test all

@mfojtik
Copy link
Contributor Author

mfojtik commented Feb 20, 2018

might need to pull in kube changes from a similar bump:
https://github.com/kubernetes/kubernetes/pull/57480/commits

As strange as it is, we seems to have some of these changes already incorporated...

EDIT: Found mfojtik@ffb2172

It has all these changes, one of them was little wrong, but I don't think that will affect the test.

@liggitt i tested with 3.2.14 (what upstream use now) and I'm getting same unit test failure and it is reproducible locally as well... That means this is going to kick us in 1.10.x right?

EDIT: Found the issue. The test is actually timeouting because we set the unit test timeout to 120s. This test seems to now take over 200s:

[mfojtik@dev origin]$ ./hack/test-go.sh ./vendor/k8s.io/kubernetes/pkg/master
[INFO] [CLEANUP] Cleaning up temporary directories
ok  	github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/master	224.226s	coverage: 51.8% of statements
[INFO] hack/test-go.sh exited with code 0 after 00h 03m 55s

In k8s: kubernetes/kubernetes#59685 and kubernetes/kubernetes#59838 are related. @liggitt are you ok with living with increased timeout till the upstream issue is fixed? (i guess we will have to do this for 3.10 anyway)

@mfojtik
Copy link
Contributor Author

mfojtik commented Feb 20, 2018

I added bump(go-openapi/spec): 1de3e0542de65ad8d75452a595886fdd0befb363 as it claims to cut ~25s from the test run. Also increased the unit test timeout to see if this will pass.

@mfojtik
Copy link
Contributor Author

mfojtik commented Feb 20, 2018

does updating grpc have implications for containerd, cadvisor, or kubelet APIs?

@liggitt it does (from description of upstream bump):

Upgrade to containerd v1.0.0? Currently kubernetes depends on containerd v1.0.0-beta.2-159-g27d450a0 which depends on grpc v1.3.0, but containerd v1.0.0 depends on grpc 1.7.2. Not needed. The containerd grpc upgrade required no code changes.

I think we already are on [email protected] so that should be no-op.

@liggitt
Copy link
Contributor

liggitt commented Feb 20, 2018

github.com/openshift/origin/pkg/cmd/server/kubernetes/master TestNewMasterLeasesHasCorrectTTL 0.12s

go run hack/e2e.go -v -test --test_args='--ginkgo.focus=github.com/openshift/origin/pkg/cmd/server/kubernetes/master\sTestNewMasterLeasesHasCorrectTTL$'

utils.go:293: Failed to start etcd server error=failed to initialize the etcd server: auth: invalid auth options

@liggitt
Copy link
Contributor

liggitt commented Feb 20, 2018

@liggitt are you ok with living with increased timeout till the upstream issue is fixed?

no, we need to know why this change destroyed our package test time:

before this PR:

ok  	github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/master	9.646s

with this PR:

ok  	github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/master	456.762s	

@liggitt
Copy link
Contributor

liggitt commented Feb 20, 2018

fast
--- PASS: TestWriteClientCAs (0.00s)
--- PASS: TestReconcileEndpoints (0.02s)
--- PASS: TestCreateOrUpdateMasterService (0.00s)
--- PASS: TestGroupVersions (0.00s)
--- PASS: TestTypeTags (0.10s)
--- PASS: TestLegacyRestStorageStrategies (1.68s)
--- PASS: TestCertificatesRestStorageStrategies (1.78s)
--- PASS: TestVersion (2.32s)
--- PASS: TestGetNodeAddresses (0.00s)
--- PASS: TestAPIVersionOfDiscoveryEndpoints (1.52s)
--- PASS: TestNoAlphaVersionsEnabledByDefault (0.00s)

slow
--- PASS: TestWriteClientCAs (0.00s)
--- PASS: TestReconcileEndpoints (0.02s)
--- PASS: TestCreateOrUpdateMasterService (0.01s)
--- PASS: TestGroupVersions (0.00s)
--- PASS: TestTypeTags (0.23s)
--- PASS: TestLegacyRestStorageStrategies (4.51s)
--- PASS: TestCertificatesRestStorageStrategies (37.12s)
--- PASS: TestVersion (110.88s)
--- PASS: TestGetNodeAddresses (0.01s)
--- PASS: TestAPIVersionOfDiscoveryEndpoints (299.09s)
--- PASS: TestNoAlphaVersionsEnabledByDefault (0.00s)

slow2:
--- PASS: TestWriteClientCAs (0.00s)
--- PASS: TestReconcileEndpoints (0.01s)
--- PASS: TestCreateOrUpdateMasterService (0.01s)
--- PASS: TestGroupVersions (0.00s)
--- PASS: TestTypeTags (0.10s)
--- PASS: TestLegacyRestStorageStrategies (7.48s)
--- PASS: TestCertificatesRestStorageStrategies (36.00s)
Timed out on TestVersion

@mfojtik
Copy link
Contributor Author

mfojtik commented Feb 20, 2018

/retest

@mfojtik
Copy link
Contributor Author

mfojtik commented Feb 20, 2018

@liggitt TestAPIVersionOfDiscoveryEndpoints ?

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 20, 2018

@mfojtik: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/unit 174022f link /test unit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@mfojtik
Copy link
Contributor Author

mfojtik commented Feb 23, 2018

Moved to #18731

@mfojtik mfojtik deleted the bump-etcd branch September 5, 2018 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data race in grpc lib used by etcd client
7 participants