-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(oidc): add considerations for impacted kube-apiserver admission plugins #1726
(oidc): add considerations for impacted kube-apiserver admission plugins #1726
Conversation
Skipping CI for Draft Pull Request. |
ba8e816
to
8246eb8
Compare
This will be done through updates to the appropriate config observers to update the `KubeAPIServerConfig.apiServerArguments` map to: | ||
|
||
- Remove the `authorization.openshift.io/RestrictSubjectBindings` and `authorization.openshift.io/ValidateRoleBindingRestriction` admission plugins from the `--enable-admission-plugins` argument | ||
- Add the `authorization.openshift.io/RestrictSubjectBindings` and `authorization.openshift.io/ValidateRoleBindingRestriction` admission plugins to the `--disable-admission-plugins` argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity: AFAIU normally it would be enough to remove the plugins from the --enable-admission-plugins
arg, as they are not default plugins that need explicit disabling. However, the config observer doesn't have access to the final config object and therefore the --enable-admission-plugins
field, therefore we'll use the --disable-admission-plugins
to indicate what needs disabling. We'll also need a special merge so that it gets removed from enabled and added to disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC from my experimenting, overriding the --enable-admission-plugins
in the config to no longer include these admission plugins did not sufficiently disable them and is why I specifically call out adding them to the --disable-admission-plugins
flag.
I'm not sure we need to into the exact semantics of how this achieved, but if we do I'm happy to do a bit more digging and figuring out what changes may need to be made to the config logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, no need to go into more detail here; I just added this note as a result of some digging I did, as a note to ourselves.
|
||
This will mean vendoring the generated CRD manifests as outlined in https://github.com/openshift/api/tree/master?tab=readme-ov-file#vendoring-generated-manifests-into-other-repositories and adding a new controller to manage the CRD. | ||
|
||
Managing the CRD will consist of ensuring that the CRD is present on the cluster, and matches the desired manifest, when the authentication type is _not_ OIDC, and ensuring the CRD is not present present on the cluster when the authentication type _is_ OIDC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Managing the CRD will consist of ensuring that the CRD is present on the cluster, and matches the desired manifest, when the authentication type is _not_ OIDC, and ensuring the CRD is not present present on the cluster when the authentication type _is_ OIDC. | |
Managing the CRD will consist of ensuring that the CRD is present on the cluster, and matches the desired manifest, when the authentication type is _not_ OIDC, and ensuring the CRD is not present on the cluster when the authentication type _is_ OIDC and OIDC configuration has been rolled out. |
If we remove the CRD the moment the auth type becomes OIDC
, we won't give time to the admins to react in case any RBRs exist, as the CRD will be removed immediately (and therefore any existing resources). I believe we'll want this in two steps: CAO complains if RBRs exist, and doesn't proceed with OIDC rollout. Once they are deleted, OIDC rollout proceeds. Once it is completed and OIDC is available (we'll use the new API field for that), OAuth cleanup starts, which includes deleting the CRD.
For the moment, this is the condition used to determine when OIDC has been enabled: https://github.com/openshift/cluster-authentication-operator/pull/740/files#diff-51c6cd196c758006bbe84eed012e6baac4713a856a96b7dfd10adc8ad7986e48R20
When we'll have the new API though, we'll use that to determine that it's available (i.e. Available=True
). The KAS-o config observer will make sure to update the status accordingly when it detects that the KAS pods have been rolled out with OIDC.
|
||
The OIDC authentication mode on the cluster will not be allowed to be enabled if any `RoleBindingRestriction` resources exist. | ||
|
||
To communicate the reason for the enablement of the OIDC functionality being blocked, the `Authentication` API will be extended with a new status field to communicate the condition of the OIDC feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss further how we'll communicate this; for example, we can set Available=False/Degraded=True
when RBRs exist. We'll need to also take care of some corner cases, e.g. what if someone creates RBRs after the CAO has started the rollout, but before the KAS pods have restarted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to discussing further how we communicate this. I'll go into a bit more detail on this and then we can refine it from there.
For the corner case where a RBR is created after the CAO has already started the rollout process but before the KAS pods have restarted, my expectation is that we remove the CRD, which in turn deletes the CRs (in this case the newly created RBRs). We can discuss this a bit further if we think that this is an unacceptable user experience, but I think this would be OK for now. We could add warnings in the OpenShift documentation for enabling OIDC that any RBRs created during the rollout of the OIDC functionality will be automatically removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my expectation is that we remove the CRD, which in turn deletes the CRs
I also think this sounds good enough for now 👍
/lgtm Holding until update commits are squashed. |
6a23a52
to
611371b
Compare
611371b
to
897ae74
Compare
/lgtm |
/hold cancel |
|
||
- Disable the `authorization.openshift.io/RestrictSubjectBindings` and `authorization.openshift.io/ValidateRoleBindingRestriction` admission plugins | ||
- Remove the `rolebindingrestrictions.authorization.openshift.io` CustomResourceDefinition | ||
- Block OIDC enablement while any `RoleBindingRestriction` resources exist; This will be communicated in the `Authentication` resource via the `OIDCConfig` status field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we did this, how would an HCP user be aware?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does HCP already alert users on cluster operator degraded conditions? This status condition on the Authentication
resource is in addition to the existing cluster-authentication-operator cluster operator conditions (don't recall the exact resource type off the top of my head)
|
||
##### Changes to the cluster-kube-apiserver-operator | ||
|
||
When authentication type is set to OIDC, the `authorization.openshift.io/RestrictSubjectBindings` and `authorization.openshift.io/ValidateRoleBindingRestriction` admission plugins will be disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be specific. Is this spec or status? Can you describe what should happen in the transition case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in c32abf3
|
||
This will mean vendoring the generated CRD manifests as outlined in https://github.com/openshift/api/tree/master?tab=readme-ov-file#vendoring-generated-manifests-into-other-repositories and adding a new controller to manage the CRD. | ||
|
||
Managing the CRD will consist of ensuring that the CRD is present on the cluster, and matches the desired manifest, when the authentication type is _not_ OIDC, and ensuring the CRD is not present present on the cluster when the authentication type _is_ OIDC and OIDC configuration has been successfully rolled out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking through cases, is it actually
- internal oath server is desired, ensure crd/rolebindingrestriction
- internal oauth server is configured and not desired and any rolebindingrestrictions currently exist, ensure crd/rolebindingrestriction. This supports a migration case to avoid removing groups and users while the tokens are still honored
- internal oauth server is configured and not desired and no rolebindingrestrictions currently exist, remove crd/rolebindingrestriction. This supports a migration case.
- internal oauth server is not configuration and not desired, remove crd/rolebindingrestriction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that all sounds right. I'll update this section to be more clear on the different scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in c32abf3
Additionally, the CAO will be updated to block OIDC configuration on existence of `RoleBindingRestriction` resources. If `RoleBindingRestriction` resources are found, | ||
the Authentication CR's `OIDCConfig` status field will be updated to contain the following conditions: | ||
|
||
- Condition: `Progressing`, Status: `False`, Reason: `Blocked`, Message: `OIDC configuration blocked: RoleBindingRestriction resources found` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't force progressing to false. it'll go progressing for other reasons. Instead use Degraded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for a non-cluster operator related resource (and by this I mean when the cluster operator sync fails this resource shouldn't be updated). This api field doesn't currently exist and therefore nothing is setting it to progressing. I'm fine with changing this to degraded if you think that makes the most sense for this case, but we were trying to be cautious of setting degraded unless the cluster is actually in a broken state.
// +patchStrategy=merge | ||
// +listType=map | ||
// +listMapKey=type | ||
Conditions []metav1.Condition `json:"conditions"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather that conditions, do we really want to know
- required: is the internal oauth server active
- optional: is the external OIDC configured. (what even knows this. nothing?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are trying to know when the new revisions of the kube-apiserver have successfully rolled out where it is interacting directly with the external OIDC provider for authn decisions.
We thought the conditions pattern would allow:
- cluster-authentication-operator to communicate to users progressing status of the oidc configuration
- cluster-kube-apiserver-operator to communicate to the cluster-authentication-operator and users when the apiserver rollout with this configuration was successful and ready to be used
- In the case of the cluster-authentication-operator noticing that the rollout was successful, it would begin to remove the oauth workloads and resources from the cluster
I don't think we care about the state of the oauth server unless something goes wrong tearing it down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can deduce whether external OIDC has been configured by the following:
- is there a revisioned OIDC structured auth configmap for each observed current revision of the KAS pods?
- is the respective structured auth KAS CLI arg enabled for each observed current revision of the KAS pods?
- are the respective OAuth KAS CLI args disabled for each observed current revision of the KAS pods?
If all the above are true, we can deduce that external OIDC has been configured and rolled out.
If at least one of the KAS pods is on a revision that does not include an OIDC specific config, there is a rollout in progress which is either enabling, or disabling OIDC.
The KAS-o can monitor the rollout status of the KAS pods and update the OIDCConfig.Conditions
accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept matches what I expected. Some questions on the conditionals, but I'm hopeful it falls out easily.
|
||
- Logs in the KAS reflecting that informers for OAuth APIs could not be started | ||
- Logs in the KAS reflecting that the oauth-apiserver is unreachable | ||
- Errors, that are not actionable from the perspective of users, occurring during validation of `RoleBinding` resources when a `RoleBindingRestriction` resource in the same namespace specifies user/group restrictions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps clarify that this can prevent the creation of rolebindings in namespaces, which makes it impossible to manage permissions in that namespace.
|
||
These impacts may cause concern for cluster administrators that their cluster is in an unhealthy state and may not be easily correlated to the enablement of the OIDC functionality. | ||
|
||
To ensure a holistic change in cluster behavior when enabling the OIDC functionality, and reduce potentially concerning behavior in the control plane components, it is proposed that changes are made to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want to do these things when the internal oauth-server is not used. OIDC enablement or not is immaterial, right?
This is because the user to group mapping used by the admission plugin won't be accurate if the internal oauth-server doesn't produce the users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want to do these things when the internal oauth-server is not used. OIDC enablement or not is immaterial, right?
Correct. I'll update this.
|
||
##### Changes to the cluster-kube-apiserver-operator | ||
|
||
When the `Authentication` resource with name `cluster` has `.spec.type` set to `OIDC`, the `authorization.openshift.io/RestrictSubjectBindings` and `authorization.openshift.io/ValidateRoleBindingRestriction` admission plugins will be disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the admission plugin should be disabled when .spec.type is set to something other than IntegratedOAuth
since the user/group mappings will be invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is something @liouk and I had an open question on and were going to ask in the arch call tomorrow. We weren't sure if setting None
was equivalent to "I don't want the OpenShift oauth server" and should result in this teardown behavior or if it was more of a "keep the oauth server up and configured, but don't manage it anymore".
I think it sounds reasonable to have this be the behavior when anything other than IntegratedOAuth
is set.
|
||
To support the need to remove the `rolebindingrestrictions.authorization.openshift.io` CustomResourceDefinition when the OpenShift oauth server is not desired, the cluster-authentication-operator will be updated to manage this CRD. | ||
|
||
This will mean vendoring the generated CRD manifests as outlined in https://github.com/openshift/api/tree/master?tab=readme-ov-file#vendoring-generated-manifests-into-other-repositories and adding a new controller to manage the CRD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the static resource controller has the concept of conditional resources with separate conditions for creation and deletion. Have a look to see if you can wire it up to do what you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what I had in mind - I'll make it a bit more explicit.
|
||
##### Changes to openshift/api | ||
|
||
As the cluster-authentication-operator will now be responsible for the `rolebindingrestrictions.authorization.openshift.io` CRD, it should no longer be added to the openshift/api payload manifests that are included in a payload image and get managed by CVO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to be sure this and the admission plugin are disabled in microshift?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with microshift beyond having heard of it before. If it doesn't make use of the OpenShift oauth server then I would expect that the RBR CRD and associated admission plugins are disabled in microshift. I'll add another section to capture this.
f1b0107
to
1ff9c7d
Compare
c377188
to
e34c7aa
Compare
7d34b35
to
367aa3f
Compare
/lgtm |
- `Group` informer creation and configuration is moved into the `authorization.openshift.io/RestrictSubjectBindings` admission plugin initialization process | ||
- Configuration options are added to the `authorization.openshift.io/RestrictSubjectBindings` admission plugin, allowing for external components to specify the desired state of the OAuth stack. | ||
|
||
The configuration option proposed is as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure I'm being dense but... why do we need this plugin configuration? KASO writes it but what reads it? It seems like the KAS configuration is still controlled by {enable|disable}-admission-plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plugin configuration is read by the kube-apiserver during startup and is passed to the admission plugin. We are still explicitly enabling/disabling the admission plugin(s) to be more easily viewable and understandable that we are intentionally disabling the plugins.
Disabling an admission plugin solely with the --disable-admission-plugins
flag doesn't seem to prevent initialization of the plugin which, in this case, attempts to configure and start informers for an API that will not be present on the cluster. This isn't in and of itself a major issue, but results in a lot of suspicious and noisy logs. Adding the plugin configuration piece allows us to prevent the problematic initialization as the configuration step happens prior to the initialization step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling an admission plugin solely with the
--disable-admission-plugins
flag doesn't seem to prevent initialization of the plugin
This seems like a bug with --disable-admission-plugins
no? If you disable it but some part of it still runs? But somehow the plugin configuration is fed in early enough to stop the whole plugin?
Said another way, if --disable-admission-plugins
actually disabled the plugin (informers included), then we wouldn't need the plugin configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some additional exploration on this and I was unable to reproduce this behavior which tells me this was a mistake I made during my initial testing of this. I've updated this PR to no longer include the addition of a configuration option for the admission plugin and to instead rely solely on the --enable-admission-plugins
and --disable-admission-plugins
flags.
Signed-off-by: Bryce Palmer <[email protected]>
367aa3f
to
78a0079
Compare
@everettraven: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
LGTM |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sjenning The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
Updates the original OIDC enhancement proposal to add some considerations for how we resolve an issue with the OpenShift default authorization.openshift.io/RestrictSubjectBindings admission plugin when enabling OIDC.