Sequential new-app commands cause possibly exponential and latent CPU consumption #5737

ironcladlou · 2015-11-05T16:26:54Z

I started an etcd instance, openshift master, and openshift node on Fedora 21. I have a bash script which makes 500 new projects:

#!/bin/bash

count=$1
rate=$2

if [ "${count}" == "" ]; then
  echo "count is required"
  exit 1
fi

for i in $(seq 1 $count); do
  id=$(cat /dev/urandom | tr -cd 'a-f0-0' | head -c 8)
  name="project-${id}"
  /usr/bin/time -f "${i},%e" bash -c "openshift cli new-project $name >/dev/null"
  if [ "${rate}" != "" ]; then
    sleep $rate
  fi
done

The result is CPU usage climbs and the master starts grinding down. Even long after I cut the script off at ~160 projects, master CPU consumption is still at like 60% but seems to be falling very, very gradually. I haven't yet graphed CPU consumption.

I'll attach the openshift cli new-project times and some pprof traces captured during and after the test duirng the weird "idle" period.

The text was updated successfully, but these errors were encountered:

ironcladlou · 2015-11-05T16:27:29Z

X is project number, Y is time to run new-project

ironcladlou · 2015-11-05T16:29:10Z

Here's a 30s CPU profile from the time long after the test script stopped and while the master CPU consumption was still ~60%.

ironcladlou · 2015-11-05T19:16:32Z

Sorry I don't have a better place to put this right now, but:

https://drive.google.com/a/redhat.com/file/d/0BxiBh6KZZAOWQ0NMYXNuRnZjbWc/view?usp=sharing

I use this script to start profiling. I then use the script in the issue description to create 300 projects. The profiler script generates the following every 30s:

pprof cpu graph
pprof heap alloc_space graph
system memory (ps)
system memory (smem)

It was going too slow by the time it got to project 260 so I stopped it and the last several samples have nothing talking to the master. I'm actually still collecting and the CPU is still very high but very slowly scaling down. I'll continue to collect for a while.

ironcladlou · 2015-11-06T21:46:44Z

Detailed analysis including tarballs containing tons of useful stats: https://docs.google.com/a/redhat.com/spreadsheets/d/1pjTbc6HQaRdybzScXy2NGwg3vIO2qHR7VOChl6-vuYY/edit?usp=sharing

ironcladlou · 2015-11-06T21:53:02Z

A few things we discovered:

oc on the client side will get slower over repeated new-project calls as the local config file is rebuilt, accounting for increased operation times in my original tests
back project cache with local authorizer #5760 fixes a logarithmic request count growth associated with project creation which was causing high server CPU usage
There still seems to be an unexpectedly huge amount of server request handling and crypto library calls in idle and during project create calls, but we can continue to investigate those further.

The oc discovery combined with #5760 is enough to satisfy my main concerns for this issue, and I'll open followups once I've assessed the impact of item 3.

ironcladlou · 2015-11-06T21:55:06Z

cc @smarterclayton @liggitt, let's call #5760 a fix for this.

deads2k · 2015-11-07T00:14:35Z

@ironcladlou I liked the chart of requests. Any chance you ended up getting that fully automated?

deads2k · 2015-11-07T00:19:13Z

oc on the client side will get slower over repeated new-project calls as the local config file is rebuilt, accounting for increased operation times in my original tests

The slowdown is unexpectedly large at the scales here: #5737 (comment). Its a little better for me on my machine, but I was able to get over 10 seconds at about 300-400 projects. We should try to pin this down. It might just be a matter of generating conversions and codecs.

smarterclayton · 2015-11-07T02:09:13Z

I'm seeing a ton of garbage created - will look at it with your changes but
we are probably doing a few things in critical path we shouldn't be.

On Nov 6, 2015, at 6:19 PM, David Eads [email protected] wrote:

oc on the client side will get slower over repeated new-project calls as
the local config file is rebuilt, accounting for increased operation times
in my original tests

The slowdown is unexpectedly large at the scales here: #5737 (comment)
#5737 (comment).
Its a little better for me on my machine, but I was able to get over 10
seconds at about 300-400 projects. We should try to pin this down. It might
just be a matter of generating conversions and codecs.

—
Reply to this email directly or view it on GitHub
#5737 (comment).

smarterclayton · 2015-11-07T17:06:23Z

I want to leave this open even after we merge #5760 but will reduce the priority.

smarterclayton · 2015-11-07T23:20:37Z

For 1.1.1 I'd like to verify we don't have N^2 on new project creation above 1k projectd

smarterclayton · 2015-11-08T00:03:44Z

alloc_space after creating 200 projects

      flat  flat%   sum%        cum   cum%
 1957.78MB 17.71% 17.71%  1957.78MB 17.71%  k8s.io/kubernetes/pkg/util/sets.String.Insert
 1230.19MB 11.13% 28.84%  1528.26MB 13.82%  math/big.nat.divLarge
 1093.60MB  9.89% 38.73%  1093.60MB  9.89%  reflect.Value.call
  950.08MB  8.59% 47.33%   981.58MB  8.88%  reflect.(*structType).FieldByNameFunc
  749.94MB  6.78% 54.11%   749.94MB  6.78%  bytes.makeSlice
  584.50MB  5.29% 59.40%   584.50MB  5.29%  github.com/ugorji/go/codec.(*jsonDecDriver).appendStringAsBytes
  414.41MB  3.75% 63.14%   414.41MB  3.75%  reflect.MakeSlice
  365.60MB  3.31% 66.45%   365.60MB  3.31%  math/big.nat.make
  348.52MB  3.15% 69.60%  2438.34MB 22.06%  github.com/openshift/origin/pkg/authorization/api.ExpandResources
  333.57MB  3.02% 72.62%  1420.15MB 12.85%  github.com/openshift/origin/pkg/authorization/cache.(*readOnlyClusterPolicyCache).Get
  212.62MB  1.92% 74.55%   367.63MB  3.33%  k8s.io/kubernetes/pkg/util/sets.String.List
     179MB  1.62% 76.16%      179MB  1.62%  reflect.(*structType).Field
  155.65MB  1.41% 77.57%   740.14MB  6.70%  github.com/ugorji/go/codec.(*jsonDecDriver).DecodeString
     155MB  1.40% 78.97%      155MB  1.40%  sort.StringSlice.Sort
  145.51MB  1.32% 80.29%  1242.59MB 11.24%  k8s.io/kubernetes/pkg/api/meta.Accessor

alloc_objects

      flat  flat%   sum%        cum   cum%
  18698796 14.11% 14.11%   21090889 15.91%  reflect.(*structType).FieldByNameFunc
  15799381 11.92% 26.03%   15799381 11.92%  reflect.Value.call
  13762729 10.38% 36.41%   13762729 10.38%  reflect.(*structType).Field
  10734208  8.10% 44.51%   10734208  8.10%  k8s.io/kubernetes/pkg/util/sets.String.Insert
   9529743  7.19% 51.70%   11299828  8.53%  math/big.nat.divLarge
   8978839  6.77% 58.47%   25711627 19.40%  github.com/openshift/origin/pkg/authorization/api.ExpandResources
   5963948  4.50% 62.97%    5963948  4.50%  sort.StringSlice.Sort
   4880331  3.68% 66.66%   10844279  8.18%  k8s.io/kubernetes/pkg/util/sets.String.List
   3014698  2.27% 68.93%    5122865  3.87%  k8s.io/kubernetes/pkg/api.WithNamespace
   2778805  2.10% 71.03%    3756453  2.83%  encoding/json.(*decodeState).literalStore
   2690941  2.03% 73.06%    2690941  2.03%  reflect.MakeSlice
   2225111  1.68% 74.74%   33809178 25.51%  k8s.io/kubernetes/pkg/api/meta.Accessor
   2119090  1.60% 76.33%    2119090  1.60%  k8s.io/kubernetes/pkg/api.WithValue
   2074684  1.57% 77.90%    2074684  1.57%  math/big.nat.make
   1916984  1.45% 79.35%    7366097  5.56%  math/big.(*Int).GCD
   1893925  1.43% 80.78%   31657278 23.88%  github.com/openshift/origin/pkg/authorization/cache.(*readOnlyClusterPolicyCache).Get
   1802315  1.36% 82.13%    2519663  1.90%  k8s.io/kubernetes/pkg/util/sets.NewString
   1409045  1.06% 83.20%    1409045  1.06%  github.com/openshift/origin/pkg/authorization/api.StringSubjectsFor
   1092704  0.82% 84.02%    3890073  2.93%  github.com/openshift/origin/pkg/authorization/rulevalidation.(*DefaultRuleResolver).GetRoleBindings
    592560  0.45% 84.47%    2261462  1.71%  github.com/coreos/go-etcd/etcd.(*Client).SendRequest

smarterclayton · 2015-11-08T00:04:41Z

CPU is saturated with GC

smarterclayton · 2015-11-08T00:09:59Z

alloc_objects at 500

      flat  flat%   sum%        cum   cum%
  48692369 16.22% 16.22%   54885605 18.28%  reflect.(*structType).FieldByNameFunc
  34570709 11.52% 27.73%   34570709 11.52%  reflect.(*structType).Field
  27994581  9.32% 37.06%   27994581  9.32%  reflect.Value.call
  27693764  9.22% 46.28%   27693764  9.22%  k8s.io/kubernetes/pkg/util/sets.String.Insert
  21966482  7.32% 53.60%   66298507 22.08%  github.com/openshift/origin/pkg/authorization/api.ExpandResources
  17255792  5.75% 59.35%   20755472  6.91%  math/big.nat.divLarge
  15040960  5.01% 64.36%   15040960  5.01%  sort.StringSlice.Sort
  13057435  4.35% 68.71%   28098395  9.36%  k8s.io/kubernetes/pkg/util/sets.String.List
   8421499  2.81% 71.51%   13784770  4.59%  k8s.io/kubernetes/pkg/api.WithNamespace
   5777431  1.92% 73.44%   87396636 29.11%  k8s.io/kubernetes/pkg/api/meta.Accessor
   5374194  1.79% 75.23%    5374194  1.79%  k8s.io/kubernetes/pkg/api.WithValue
   5089486  1.70% 76.92%   84490899 28.14%  github.com/openshift/origin/pkg/authorization/cache.(*readOnlyClusterPolicyCache).Get

alloc_objects at 700

      flat  flat%   sum%        cum   cum%
  61769388 16.61% 16.61%   69666586 18.74%  reflect.(*structType).FieldByNameFunc
  43614807 11.73% 28.35%   43614807 11.73%  reflect.(*structType).Field
  35037921  9.42% 37.77%   35037921  9.42%  k8s.io/kubernetes/pkg/util/sets.String.Insert
  32432782  8.72% 46.49%   32432782  8.72%  reflect.Value.call
  27286064  7.34% 53.83%   83976803 22.59%  github.com/openshift/origin/pkg/authorization/api.ExpandResources
  21108847  5.68% 59.51%   25579083  6.88%  math/big.nat.divLarge
  19006008  5.11% 64.62%   19006008  5.11%  sort.StringSlice.Sort
  16853216  4.53% 69.16%   35859224  9.65%  k8s.io/kubernetes/pkg/util/sets.String.List
  10420378  2.80% 71.96%   17433046  4.69%  k8s.io/kubernetes/pkg/api.WithNamespace
   7311088  1.97% 73.93%  110692783 29.77%  k8s.io/kubernetes/pkg/api/meta.Accessor
   7023591  1.89% 75.82%    7023591  1.89%  k8s.io/kubernetes/pkg/api.WithValue
   6482424  1.74% 77.56%  107214570 28.84%  github.com/openshift/origin/pkg/authorization/cache.(*readOnlyClusterPolicyCache).Get

tree FieldByNameFunc

(pprof) tree FieldByNameFunc
69666586 of 371775982 total (18.74%)
Dropped 176 nodes (cum <= 1858879)
Showing top 80 nodes out of 92 (cum >= 354988)
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context          
----------------------------------------------------------+-------------
                                          68384982 98.27% |   k8s.io/kubernetes/pkg/api/meta.Accessor
                                            633521  0.91% |   k8s.io/kubernetes/pkg/api.ObjectMetaFor
                                            573445  0.82% |   k8s.io/kubernetes/pkg/conversion.UpdateVersionAndKind
  61769388 16.61% 16.61%   69666586 18.74%                | reflect.(*structType).FieldByNameFunc
                                           7897198   100% |   reflect.(*structType).Field
----------------------------------------------------------+-------------

tree meta.Accessor

pprof) tree meta.Accessor
110660015 of 371775982 total (29.77%)
Dropped 75 nodes (cum <= 1858879)
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context          
----------------------------------------------------------+-------------
                                          68384982   100% |   reflect.(*structType).FieldByName
  60586089 16.30% 16.30%   68384982 18.39%                | reflect.(*structType).FieldByNameFunc
                                           7798893   100% |   reflect.(*structType).Field
----------------------------------------------------------+-------------
                                          34963945 81.76% |   reflect.(*structType).FieldByName
                                           7798893 18.24% |   reflect.(*structType).FieldByNameFunc
  42762838 11.50% 27.80%   42762838 11.50%                | reflect.(*structType).Field
----------------------------------------------------------+-------------
                                         102735031 92.87% |   k8s.io/kubernetes/pkg/client/cache.MetaNamespaceKeyFunc
                                           4722394  4.27% |   k8s.io/kubernetes/pkg/client/cache.MetaNamespaceIndexFunc
                                           1023473  0.93% |   k8s.io/kubernetes/pkg/api/meta.resourceAccessor.Name
                                            853442  0.77% |   k8s.io/kubernetes/pkg/api/meta.resourceAccessor.SetSelfLink
                                            832325  0.75% |   k8s.io/kubernetes/pkg/api/meta.resourceAccessor.Namespace
                                            456940  0.41% |   k8s.io/kubernetes/pkg/client/cache.(*Reflector).watchHandler
   7311088  1.97% 29.77%  110692783 29.77%                | k8s.io/kubernetes/pkg/api/meta.Accessor
                                          68384982 66.17% |   reflect.Value.FieldByName
                                          28180872 27.27% |   k8s.io/kubernetes/pkg/api/meta.extractFromObjectMeta
                                           6783073  6.56% |   k8s.io/kubernetes/pkg/api/meta.extractFromTypeMeta

smarterclayton · 2015-11-08T00:30:57Z

I'm going to say this is not fixed (better, but not fixed) due to significant GC issues. We'll deal with it in 1.1.1 with algorithmic and other memory fixes.

deads2k · 2015-11-09T13:10:49Z

Regarding 27286064 7.34% 53.83% 83976803 22.59% github.com/openshift/origin/pkg/authorization/api.ExpandResources

I'm pretty sure I can tweak the authorizer to avoid calling into ExpandResources in the common case and make the function cheaper in the common case as well. Shall I pursue that for 1.1?

liggitt · 2015-11-09T13:47:30Z

That seems like a likely optimization target to me as well. Were you thinking of a check to see if expansion was even necessary, or a single expansion when populating the authorizer cache (or both)?

deads2k · 2015-11-09T13:50:20Z

That seems like a likely optimization target to me as well. Were you thinking of a check to see if expansion was even necessary, or a single expansion when populating the authorizer cache (or both)?

Whether its necessary. That's easy to do without significant refactoring. The list of permissions are also relatively small, so it may be worthwhile to trade O(1) for O(n) when n < 50 compared to the GC cost.

deads2k · 2015-11-09T13:53:14Z

Whether its necessary. That's easy to do without significant refactoring. The list of permissions are also relatively small, so it may be worthwhile to trade O(1) for O(n) when n < 50 compared to the GC cost.

What do you know, already a set. I'll put it together this morning.

ironcladlou · 2015-11-09T13:54:52Z

FWIF, despite the fix @deads2k made for the request count inflation on the auth path, we still have no explanation yet (that I've seen) for the massively disproportionate amount of request handling/auth handling in both idle states and post project creation.

ironcladlou · 2015-11-09T13:57:31Z

Take a look at the cachegrind data during the creation of just 100 projects (this includes #5760 and uses curl for the project creation): https://drive.google.com/a/redhat.com/file/d/0BxiBh6KZZAOWN212UFRLWmlYN28/view

The .out files in pprof.cpu/ are the cachegrind files.

deads2k · 2015-11-09T14:53:50Z

Take a look at the cachegrind data during the creation of just 100 projects (this includes #5760 and uses curl for the project creation): https://drive.google.com/a/redhat.com/file/d/0BxiBh6KZZAOWN212UFRLWmlYN28/view

The .out files in pprof.cpu/ are the cachegrind files.

Can you add the pprof pngs to your script?

smarterclayton · 2017-01-16T18:24:48Z

I think this was resolved due to multiple changes (build controller O(N^2), authorization cache flush). Closing due to age.

danmcp added priority/P1 component/master labels Nov 6, 2015

danmcp assigned ironcladlou Nov 6, 2015

deads2k mentioned this issue Nov 6, 2015

back project cache with local authorizer #5760

Merged

smarterclayton added area/performance area/reliability labels Nov 7, 2015

smarterclayton added this to the 1.1.0 milestone Nov 7, 2015

openshift-bot closed this as completed in #5760 Nov 7, 2015

smarterclayton reopened this Nov 7, 2015

smarterclayton added priority/P2 component/auth and removed priority/P1 labels Nov 7, 2015

smarterclayton modified the milestones: 1.1.1, 1.1.0 Nov 7, 2015

deads2k mentioned this issue Nov 9, 2015

only expand resources when strictly needed #5791

Merged

danmcp removed the component/master label Jan 6, 2016

liggitt modified the milestones: 1.1.1, 1.1.x Jan 15, 2016

smarterclayton modified the milestones: 1.1.x, 1.2.x Feb 20, 2016

smarterclayton modified the milestones: 1.2.x, 1.3.0 Jul 12, 2016

liggitt modified the milestones: 1.3.0, 1.4.0 Sep 1, 2016

smarterclayton closed this as completed Jan 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequential new-app commands cause possibly exponential and latent CPU consumption #5737

Sequential new-app commands cause possibly exponential and latent CPU consumption #5737

ironcladlou commented Nov 5, 2015

ironcladlou commented Nov 5, 2015

ironcladlou commented Nov 5, 2015

ironcladlou commented Nov 5, 2015

ironcladlou commented Nov 6, 2015

ironcladlou commented Nov 6, 2015

ironcladlou commented Nov 6, 2015

deads2k commented Nov 7, 2015

deads2k commented Nov 7, 2015

smarterclayton commented Nov 7, 2015

smarterclayton commented Nov 7, 2015

smarterclayton commented Nov 7, 2015

smarterclayton commented Nov 8, 2015

smarterclayton commented Nov 8, 2015

smarterclayton commented Nov 8, 2015

smarterclayton commented Nov 8, 2015

deads2k commented Nov 9, 2015

liggitt commented Nov 9, 2015

deads2k commented Nov 9, 2015

deads2k commented Nov 9, 2015

ironcladlou commented Nov 9, 2015

ironcladlou commented Nov 9, 2015

deads2k commented Nov 9, 2015

smarterclayton commented Jan 16, 2017

Sequential new-app commands cause possibly exponential and latent CPU consumption #5737

Sequential new-app commands cause possibly exponential and latent CPU consumption #5737

Comments

ironcladlou commented Nov 5, 2015

ironcladlou commented Nov 5, 2015

ironcladlou commented Nov 5, 2015

ironcladlou commented Nov 5, 2015

ironcladlou commented Nov 6, 2015

ironcladlou commented Nov 6, 2015

ironcladlou commented Nov 6, 2015

deads2k commented Nov 7, 2015

deads2k commented Nov 7, 2015

smarterclayton commented Nov 7, 2015

smarterclayton commented Nov 7, 2015

smarterclayton commented Nov 7, 2015

smarterclayton commented Nov 8, 2015

smarterclayton commented Nov 8, 2015

smarterclayton commented Nov 8, 2015

smarterclayton commented Nov 8, 2015

deads2k commented Nov 9, 2015

liggitt commented Nov 9, 2015

deads2k commented Nov 9, 2015

deads2k commented Nov 9, 2015

ironcladlou commented Nov 9, 2015

ironcladlou commented Nov 9, 2015

deads2k commented Nov 9, 2015

smarterclayton commented Jan 16, 2017