enable etcd watch cache for k8s types #8395

liggitt · 2016-04-06T20:58:00Z

related to #8392

liggitt · 2016-04-06T21:10:24Z

[test] [extended:core]

liggitt · 2016-04-06T21:10:43Z

timothysc · 2016-04-06T21:18:08Z

Godeps/_workspace/src/k8s.io/kubernetes/pkg/genericapiserver/server_run_options.go

@@ -46,6 +46,7 @@ func NewServerRunOptions() *ServerRunOptions {
 		InsecureBindAddress:  net.ParseIP("127.0.0.1"),
 		InsecurePort:         8080,
 		LongRunningRequestRE: defaultLongRunningRequestRE,
+		MaxRequestsInFlight:  400,


imo we can safely up this to 600, b/c we plan to override @ scale today. Smaller installations would be unaffected.

The legacy # of 400 was chosen back in the 1.0 release.

This is per master

Also, Kube doesn't always do their options "right" so we don't get the tap
on the shoulder in the code where the new options are obviously missing.

On Wed, Apr 6, 2016 at 6:19 PM, Jordan Liggitt [email protected]
wrote:

In
Godeps/_workspace/src/k8s.io/kubernetes/pkg/genericapiserver/server_run_options.go
#8395 (comment):

@@ -46,6 +46,7 @@ func NewServerRunOptions() *ServerRunOptions {
InsecureBindAddress: net.ParseIP("127.0.0.1"),
InsecurePort: 8080,
LongRunningRequestRE: defaultLongRunningRequestRE,

MaxRequestsInFlight: 400,

This is per master

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/8395/files/714b88b4308eb44f74ead7e66395fd98aa98207c#r58792210

jeremyeder · 2016-04-06T22:06:36Z

Question out of my ignorance for the process...when we rebase kube into origin, wouldn't these things be included?

liggitt · 2016-04-06T22:19:23Z

We get the capability for k8s resources, but we drive server startup from our config, not from command line flags. We continually have to verify that new kubernetes options we want are enabled, and new kubernetes options we don't want are not enabled.

timothysc · 2016-04-07T02:36:54Z

A configdump option seems like it's going from a nice-to-have, to essential... pretty fast.

liggitt · 2016-04-07T02:37:48Z

Not sure what you mean by "config dump option"

timothysc · 2016-04-07T02:42:35Z

The main idea was to dump all the values of all the knobs. kubernetes/kubernetes#14916

liggitt · 2016-04-07T02:53:36Z

Ah. Unfortunately, at the point where the config, defaults, and flags have been transformed into the actual options structs used to run the components, about a third of the fields are no longer serializable values, but are instantiated runtime objects.

derekwaynecarr · 2016-04-07T14:54:29Z

I am in favor of enabling this change.

ncdc · 2016-04-07T15:03:19Z

This is probably something we should turn on, considering it's a big performance win in Kube 1.2. My only hesitation is how late in the release cycle we are, but I think we're better off enabling it and keeping an eye out for weird/unexpected issues. And then make supporting the watch cache for origin resources a P0 but maybe not strictly required for 3.2.

jeremyeder · 2016-04-07T15:26:09Z

@derekwaynecarr thoughts on this being related to web UI scaling issues, as well?

liggitt · 2016-04-07T15:27:42Z

It's definitely related. Multiple identical watches should not all be hitting the etcd backend with this cache turned on

ncdc · 2016-04-07T15:28:09Z

Even more reason to enable it, and add support for the Origin resources too

jeremyeder · 2016-04-07T15:29:11Z

@abhgupta this one should be on your list.

liggitt · 2016-04-07T15:29:52Z

Totally agree, just a question of risk. I'm 65% in favor of enabling the but we have now (k8s resource watch cache)

ncdc · 2016-04-07T15:30:13Z

Let's enumerate the possible known risks?

timothysc · 2016-04-07T15:55:19Z

I'm testing by explicitly setting right now.

jeremyeder · 2016-04-07T17:18:50Z

I've also asked @rflorenc to re-run his webUI tests with

kubernetesMasterConfig: 
  apiServerArguments:
    watch-cache:
      - "true"

liggitt · 2016-04-07T17:20:59Z

You need this PR to make the setting effective

smarterclayton · 2016-04-07T17:26:05Z

Why wouldn't that be enabled? It's on the APIServer struct that we would pass down?

smarterclayton · 2016-04-07T17:28:58Z

.... we don't use the APIServer object we create anywhere later?

liggitt · 2016-04-07T17:37:50Z

only to populate the genericapiserver config struct at https://github.com/openshift/origin/pull/8395/files#diff-05523003a782d7b3b61c2608a29dfb39R255

liggitt · 2016-04-07T17:48:59Z

The main risk is that the watch cache behaves differently than the watch directly against etcd. Looks like one of the test runs failed with this:

FAILURE after 1.311s: test/cmd/admin.sh:364: executing 'oc get user/~ --token="$( oc sa new-token my-sa-name )"' expecting success and text 'system:serviceaccount:.+:my-sa-name': the output content test failed
Standard output from the command:
NAME           UID       FULL NAME   IDENTITIES
system:admin                         
Standard error from the command:
error: unxepected action: token was added after initial creation

which means we got an Added event on a watch started from the resource version of a newly created object. That sounds like different behavior to me.

smarterclayton · 2016-04-07T17:50:35Z

Just seeing this makes me say this is too risky for 1.2. We'll have to get
a much longer bake time.

On Thu, Apr 7, 2016 at 1:49 PM, Jordan Liggitt [email protected]
wrote:

The main risk is that the watch cache behaves differently than the watch
directly against etcd. Looks like one of the test runs failed with this:

FAILURE after 1.311s: test/cmd/admin.sh:364: executing 'oc get user/~ --token="$( oc sa new-token my-sa-name )"' expecting success and text 'system:serviceaccount:.+:my-sa-name': the output content test failed
Standard output from the command:
NAME UID FULL NAME IDENTITIES
system:admin
Standard error from the command:
error: unxepected action: token was added after initial creation

which means we got an Added event on a watch started from the resource
version of a newly created object. That sounds like different behavior to
me.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8395 (comment)

timothysc · 2016-04-07T17:52:56Z

Could we default off, and vet for openshift?

liggitt · 2016-04-07T17:54:02Z

sure

timothysc · 2016-04-07T18:26:25Z

What is weird to me is storage is configured per-resource, and each resource sets the boundary for it's cache.

timothysc · 2016-04-07T21:46:09Z

So in digging into the reason for the error it looks like openshift is triggering on implied watch event semantics, where upstream does not.

Upstream tests on the cacher clearly show semantics as
watch.Added then watch.Modified

Where the code in https://github.com/openshift/origin/blob/master/pkg/cmd/cli/sa/newtoken.go#L219 could probably just read as:

case watch.Added, watch.Modified:

@liggitt are there other failures besides this one, that I'm not seeing?

liggitt · 2016-04-07T21:48:13Z

hadn't dug yet, and that failure blocked later tests from running

liggitt · 2016-04-07T23:12:02Z

without the cache, watching from resourceVersion N, the first delivered event is the first resourceVersion past N

watching from resource version 325
Got MODIFIED event for resource version 326

with the cache, watching from resourceVerson N, the first delivered event is resourceVersion N.

watching from resource version 311
Got ADDED event for resource version 311

that means the cache is not a transparent change... we'll need to fix that before we can enable it

and that behavior just shipped in 1.2 upstream :(

smarterclayton · 2016-04-07T23:22:35Z

We definitely need to fix upstream - that's a breaking, non backwards compatible change, and it's horrifying no one noticed.

liggitt · 2016-04-07T23:47:41Z

opened kubernetes/kubernetes#24004

liggitt · 2016-04-08T13:29:31Z

updated with upstream watch cache fix, rerunning tests

jeremyeder · 2016-04-08T13:38:43Z

@liggitt so ... with that should we re-test at this point?

liggitt · 2016-04-08T13:43:59Z

won't really affect performance numbers, it'll just let our tests and controllers that really care about exact resourceVersion starting points work correctly

jeremyeder · 2016-04-08T13:44:30Z

ah. ok. thank you.

smarterclayton · 2016-04-08T15:22:17Z

[test]

On Fri, Apr 8, 2016 at 10:55 AM, OpenShift Bot [email protected]
wrote:

continuous-integration/openshift-jenkins/test NOTFOUND (
https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/2827/)
(Extended Tests: core)

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8395 (comment)

openshift-bot · 2016-04-08T17:12:00Z

Evaluated for origin test up to 3639830

openshift-bot · 2016-04-08T18:50:32Z

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/2850/) (Extended Tests: core)

liggitt · 2016-04-08T19:21:36Z

clean origin test runs. 2 failures in extended tests:

Summarizing 2 Failures:

[Fail] deployments: parallel: test deployment test deployment [It] should run a deployment to completion and then scale to zero
/data/src/github.com/openshift/origin/test/extended/deployments/deployments.go:29

[Fail] Kubectl client Update Demo [It] should scale a replication controller [Conformance]
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/test/e2e/util.go:1469

smarterclayton · 2016-04-08T19:37:27Z

Those are known flakes

On Apr 8, 2016, at 3:21 PM, Jordan Liggitt [email protected] wrote:

clean origin test runs. 2 failures in extended tests:

Summarizing 2 Failures:

[Fail] deployments: parallel: test deployment test deployment [It] should
run a deployment to completion and then scale to zero
/data/src/
github.com/openshift/origin/test/extended/deployments/deployments.go:29

[Fail] Kubectl client Update Demo [It] should scale a replication
controller [Conformance]
/data/src/
github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/test/e2e/util.go:1469

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8395 (comment)

timothysc · 2016-04-08T19:41:20Z

Are we in the clear then?

We're fixing the dep issues on the e2es upstream, btw.

smarterclayton · 2016-04-11T20:02:17Z

Approved Lgtm [merge]

smarterclayton · 2016-04-11T21:36:34Z

[merge] now that Sams fix is in the queue

openshift-bot · 2016-04-11T23:15:18Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5566/) (Image: devenv-rhel7_3952)

smarterclayton · 2016-04-12T00:01:16Z

Wonderful:

--- FAIL: TestTryOrdering (0.02s)
    rate_limited_queue_test.go:179: order was wrong: [first third second]
FAIL
coverage: 61.4% of statements

smarterclayton · 2016-04-12T00:04:27Z

Spawned kubernetes/kubernetes#24125

[merge]

liggitt · 2016-04-12T00:04:36Z

Not new to this PR

openshift-bot · 2016-04-12T00:05:17Z

Evaluated for origin merge up to 3639830

liggitt mentioned this pull request Apr 6, 2016

kubernetes defaults not held downstream. #8392

Closed

timothysc reviewed Apr 6, 2016
View reviewed changes

liggitt changed the title ~~WIP - enable etcd watch cache for k8s types~~ enable etcd watch cache for k8s types Apr 8, 2016

liggitt added 3 commits April 8, 2016 11:49

UPSTREAM: 24008: Make watch cache behave like uncached watch

1f9b8e5

UPSTREAM: 24048: Use correct defaults when binding apiserver flags

c3d1d37

Enable etcd cache for k8s resources

3639830

liggitt mentioned this pull request Apr 11, 2016

enable etcd watch cache for origin types #8469

Closed

openshift-bot merged commit 0910128 into openshift:master Apr 12, 2016

liggitt deleted the etcd-cache branch April 14, 2016 14:06

enable etcd watch cache for k8s types #8395

enable etcd watch cache for k8s types #8395

Conversation

liggitt commented Apr 6, 2016

liggitt commented Apr 6, 2016

liggitt commented Apr 6, 2016

timothysc Apr 6, 2016

Choose a reason for hiding this comment

liggitt Apr 6, 2016

Choose a reason for hiding this comment

smarterclayton Apr 6, 2016

Choose a reason for hiding this comment

jeremyeder commented Apr 6, 2016

liggitt commented Apr 6, 2016

timothysc commented Apr 7, 2016

liggitt commented Apr 7, 2016

timothysc commented Apr 7, 2016

liggitt commented Apr 7, 2016

derekwaynecarr commented Apr 7, 2016

ncdc commented Apr 7, 2016

jeremyeder commented Apr 7, 2016

liggitt commented Apr 7, 2016

ncdc commented Apr 7, 2016

jeremyeder commented Apr 7, 2016

liggitt commented Apr 7, 2016

ncdc commented Apr 7, 2016

timothysc commented Apr 7, 2016

jeremyeder commented Apr 7, 2016

liggitt commented Apr 7, 2016

smarterclayton commented Apr 7, 2016

smarterclayton commented Apr 7, 2016

liggitt commented Apr 7, 2016

liggitt commented Apr 7, 2016

smarterclayton commented Apr 7, 2016

timothysc commented Apr 7, 2016

liggitt commented Apr 7, 2016

timothysc commented Apr 7, 2016

timothysc commented Apr 7, 2016

liggitt commented Apr 7, 2016

liggitt commented Apr 7, 2016

smarterclayton commented Apr 7, 2016 via email

liggitt commented Apr 7, 2016

liggitt commented Apr 8, 2016

jeremyeder commented Apr 8, 2016

liggitt commented Apr 8, 2016

jeremyeder commented Apr 8, 2016

smarterclayton commented Apr 8, 2016

openshift-bot commented Apr 8, 2016

openshift-bot commented Apr 8, 2016

liggitt commented Apr 8, 2016

smarterclayton commented Apr 8, 2016

timothysc commented Apr 8, 2016

smarterclayton commented Apr 11, 2016 via email

smarterclayton commented Apr 11, 2016 via email

openshift-bot commented Apr 11, 2016

smarterclayton commented Apr 12, 2016

smarterclayton commented Apr 12, 2016

liggitt commented Apr 12, 2016

openshift-bot commented Apr 12, 2016