Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable etcd watch cache for k8s types #8395

Merged
merged 3 commits into from
Apr 12, 2016
Merged

enable etcd watch cache for k8s types #8395

merged 3 commits into from
Apr 12, 2016

Conversation

liggitt
Copy link
Contributor

@liggitt liggitt commented Apr 6, 2016

related to #8392

@liggitt
Copy link
Contributor Author

liggitt commented Apr 6, 2016

[test] [extended:core]

@liggitt
Copy link
Contributor Author

liggitt commented Apr 6, 2016

@smarterclayton @timothysc

@@ -46,6 +46,7 @@ func NewServerRunOptions() *ServerRunOptions {
InsecureBindAddress: net.ParseIP("127.0.0.1"),
InsecurePort: 8080,
LongRunningRequestRE: defaultLongRunningRequestRE,
MaxRequestsInFlight: 400,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo we can safely up this to 600, b/c we plan to override @ scale today. Smaller installations would be unaffected.

The legacy # of 400 was chosen back in the 1.0 release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is per master

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Kube doesn't always do their options "right" so we don't get the tap
on the shoulder in the code where the new options are obviously missing.

On Wed, Apr 6, 2016 at 6:19 PM, Jordan Liggitt [email protected]
wrote:

In
Godeps/_workspace/src/k8s.io/kubernetes/pkg/genericapiserver/server_run_options.go
#8395 (comment):

@@ -46,6 +46,7 @@ func NewServerRunOptions() *ServerRunOptions {
InsecureBindAddress: net.ParseIP("127.0.0.1"),
InsecurePort: 8080,
LongRunningRequestRE: defaultLongRunningRequestRE,

  •   MaxRequestsInFlight:  400,
    

This is per master


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/8395/files/714b88b4308eb44f74ead7e66395fd98aa98207c#r58792210

@jeremyeder
Copy link
Contributor

Question out of my ignorance for the process...when we rebase kube into origin, wouldn't these things be included?

@liggitt
Copy link
Contributor Author

liggitt commented Apr 6, 2016

We get the capability for k8s resources, but we drive server startup from our config, not from command line flags. We continually have to verify that new kubernetes options we want are enabled, and new kubernetes options we don't want are not enabled.

@timothysc
Copy link

A configdump option seems like it's going from a nice-to-have, to essential... pretty fast.

@liggitt
Copy link
Contributor Author

liggitt commented Apr 7, 2016

Not sure what you mean by "config dump option"

@timothysc
Copy link

The main idea was to dump all the values of all the knobs. kubernetes/kubernetes#14916

@liggitt
Copy link
Contributor Author

liggitt commented Apr 7, 2016

Ah. Unfortunately, at the point where the config, defaults, and flags have been transformed into the actual options structs used to run the components, about a third of the fields are no longer serializable values, but are instantiated runtime objects.

@derekwaynecarr
Copy link
Member

I am in favor of enabling this change.

@ncdc
Copy link
Contributor

ncdc commented Apr 7, 2016

This is probably something we should turn on, considering it's a big performance win in Kube 1.2. My only hesitation is how late in the release cycle we are, but I think we're better off enabling it and keeping an eye out for weird/unexpected issues. And then make supporting the watch cache for origin resources a P0 but maybe not strictly required for 3.2.

@jeremyeder
Copy link
Contributor

@derekwaynecarr thoughts on this being related to web UI scaling issues, as well?

@liggitt
Copy link
Contributor Author

liggitt commented Apr 7, 2016

It's definitely related. Multiple identical watches should not all be hitting the etcd backend with this cache turned on

@ncdc
Copy link
Contributor

ncdc commented Apr 7, 2016

Even more reason to enable it, and add support for the Origin resources too

@jeremyeder
Copy link
Contributor

@abhgupta this one should be on your list.

@liggitt
Copy link
Contributor Author

liggitt commented Apr 7, 2016

Totally agree, just a question of risk. I'm 65% in favor of enabling the but we have now (k8s resource watch cache)

@ncdc
Copy link
Contributor

ncdc commented Apr 7, 2016

Let's enumerate the possible known risks?

@timothysc
Copy link

I'm testing by explicitly setting right now.

@jeremyeder
Copy link
Contributor

I've also asked @rflorenc to re-run his webUI tests with

kubernetesMasterConfig: 
  apiServerArguments:
    watch-cache:
      - "true"

@liggitt
Copy link
Contributor Author

liggitt commented Apr 7, 2016

You need this PR to make the setting effective

@smarterclayton
Copy link
Contributor

Why wouldn't that be enabled? It's on the APIServer struct that we would pass down?

@smarterclayton
Copy link
Contributor

.... we don't use the APIServer object we create anywhere later?

@liggitt
Copy link
Contributor Author

liggitt commented Apr 7, 2016

@liggitt
Copy link
Contributor Author

liggitt commented Apr 7, 2016

The main risk is that the watch cache behaves differently than the watch directly against etcd. Looks like one of the test runs failed with this:

FAILURE after 1.311s: test/cmd/admin.sh:364: executing 'oc get user/~ --token="$( oc sa new-token my-sa-name )"' expecting success and text 'system:serviceaccount:.+:my-sa-name': the output content test failed
Standard output from the command:
NAME           UID       FULL NAME   IDENTITIES
system:admin                         
Standard error from the command:
error: unxepected action: token was added after initial creation

which means we got an Added event on a watch started from the resource version of a newly created object. That sounds like different behavior to me.

@smarterclayton
Copy link
Contributor

Just seeing this makes me say this is too risky for 1.2. We'll have to get
a much longer bake time.

On Thu, Apr 7, 2016 at 1:49 PM, Jordan Liggitt [email protected]
wrote:

The main risk is that the watch cache behaves differently than the watch
directly against etcd. Looks like one of the test runs failed with this:

FAILURE after 1.311s: test/cmd/admin.sh:364: executing 'oc get user/~ --token="$( oc sa new-token my-sa-name )"' expecting success and text 'system:serviceaccount:.+:my-sa-name': the output content test failed
Standard output from the command:
NAME UID FULL NAME IDENTITIES
system:admin
Standard error from the command:
error: unxepected action: token was added after initial creation

which means we got an Added event on a watch started from the resource
version of a newly created object. That sounds like different behavior to
me.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8395 (comment)

@timothysc
Copy link

Could we default off, and vet for openshift?

@liggitt
Copy link
Contributor Author

liggitt commented Apr 7, 2016

sure

@timothysc
Copy link

What is weird to me is storage is configured per-resource, and each resource sets the boundary for it's cache.

@timothysc
Copy link

So in digging into the reason for the error it looks like openshift is triggering on implied watch event semantics, where upstream does not.

Upstream tests on the cacher clearly show semantics as
watch.Added then watch.Modified

Where the code in https://github.com/openshift/origin/blob/master/pkg/cmd/cli/sa/newtoken.go#L219 could probably just read as:

case watch.Added, watch.Modified:

@liggitt are there other failures besides this one, that I'm not seeing?

@liggitt
Copy link
Contributor Author

liggitt commented Apr 7, 2016

hadn't dug yet, and that failure blocked later tests from running

@liggitt
Copy link
Contributor Author

liggitt commented Apr 7, 2016

without the cache, watching from resourceVersion N, the first delivered event is the first resourceVersion past N

watching from resource version 325
Got MODIFIED event for resource version 326

with the cache, watching from resourceVerson N, the first delivered event is resourceVersion N.

watching from resource version 311
Got ADDED event for resource version 311

that means the cache is not a transparent change... we'll need to fix that before we can enable it

and that behavior just shipped in 1.2 upstream :(

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 7, 2016 via email

@liggitt
Copy link
Contributor Author

liggitt commented Apr 7, 2016

opened kubernetes/kubernetes#24004

@liggitt
Copy link
Contributor Author

liggitt commented Apr 8, 2016

updated with upstream watch cache fix, rerunning tests

@jeremyeder
Copy link
Contributor

@liggitt so ... with that should we re-test at this point?

@liggitt
Copy link
Contributor Author

liggitt commented Apr 8, 2016

won't really affect performance numbers, it'll just let our tests and controllers that really care about exact resourceVersion starting points work correctly

@jeremyeder
Copy link
Contributor

ah. ok. thank you.

@smarterclayton
Copy link
Contributor

[test]

On Fri, Apr 8, 2016 at 10:55 AM, OpenShift Bot [email protected]
wrote:

continuous-integration/openshift-jenkins/test NOTFOUND (
https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/2827/)
(Extended Tests: core)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8395 (comment)

@liggitt liggitt changed the title WIP - enable etcd watch cache for k8s types enable etcd watch cache for k8s types Apr 8, 2016
@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 3639830

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/2850/) (Extended Tests: core)

@liggitt
Copy link
Contributor Author

liggitt commented Apr 8, 2016

clean origin test runs. 2 failures in extended tests:

Summarizing 2 Failures:

[Fail] deployments: parallel: test deployment test deployment [It] should run a deployment to completion and then scale to zero
/data/src/github.com/openshift/origin/test/extended/deployments/deployments.go:29

[Fail] Kubectl client Update Demo [It] should scale a replication controller [Conformance]
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/test/e2e/util.go:1469

@smarterclayton
Copy link
Contributor

Those are known flakes

On Apr 8, 2016, at 3:21 PM, Jordan Liggitt [email protected] wrote:

clean origin test runs. 2 failures in extended tests:

Summarizing 2 Failures:

[Fail] deployments: parallel: test deployment test deployment [It] should
run a deployment to completion and then scale to zero
/data/src/
github.com/openshift/origin/test/extended/deployments/deployments.go:29

[Fail] Kubectl client Update Demo [It] should scale a replication
controller [Conformance]
/data/src/
github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/test/e2e/util.go:1469


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8395 (comment)

@timothysc
Copy link

Are we in the clear then?

We're fixing the dep issues on the e2es upstream, btw.

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 11, 2016 via email

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 11, 2016 via email

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5566/) (Image: devenv-rhel7_3952)

@smarterclayton
Copy link
Contributor

Wonderful:

--- FAIL: TestTryOrdering (0.02s)
    rate_limited_queue_test.go:179: order was wrong: [first third second]
FAIL
coverage: 61.4% of statements

@smarterclayton
Copy link
Contributor

Spawned kubernetes/kubernetes#24125

[merge]

@liggitt
Copy link
Contributor Author

liggitt commented Apr 12, 2016

Not new to this PR

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 3639830

@openshift-bot openshift-bot merged commit 0910128 into openshift:master Apr 12, 2016
@liggitt liggitt deleted the etcd-cache branch April 14, 2016 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants