Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(oidc): add considerations for impacted kube-apiserver admission plugins #1726

Conversation

everettraven
Copy link
Contributor

@everettraven everettraven commented Dec 9, 2024

Updates the original OIDC enhancement proposal to add some considerations for how we resolve an issue with the OpenShift default authorization.openshift.io/RestrictSubjectBindings admission plugin when enabling OIDC.

Copy link
Contributor

openshift-ci bot commented Dec 9, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 9, 2024
@everettraven everettraven force-pushed the update/external-oidc-apiserver-impact branch from ba8e816 to 8246eb8 Compare December 18, 2024 21:01
@everettraven everettraven changed the title wip: add considerations for kube-apiserver admission plugins when ext… (oidc): add considerations for impacted kube-apiserver admission plugins Dec 18, 2024
@everettraven everettraven marked this pull request as ready for review December 18, 2024 21:02
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 18, 2024
This will be done through updates to the appropriate config observers to update the `KubeAPIServerConfig.apiServerArguments` map to:

- Remove the `authorization.openshift.io/RestrictSubjectBindings` and `authorization.openshift.io/ValidateRoleBindingRestriction` admission plugins from the `--enable-admission-plugins` argument
- Add the `authorization.openshift.io/RestrictSubjectBindings` and `authorization.openshift.io/ValidateRoleBindingRestriction` admission plugins to the `--disable-admission-plugins` argument
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity: AFAIU normally it would be enough to remove the plugins from the --enable-admission-plugins arg, as they are not default plugins that need explicit disabling. However, the config observer doesn't have access to the final config object and therefore the --enable-admission-plugins field, therefore we'll use the --disable-admission-plugins to indicate what needs disabling. We'll also need a special merge so that it gets removed from enabled and added to disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC from my experimenting, overriding the --enable-admission-plugins in the config to no longer include these admission plugins did not sufficiently disable them and is why I specifically call out adding them to the --disable-admission-plugins flag.

I'm not sure we need to into the exact semantics of how this achieved, but if we do I'm happy to do a bit more digging and figuring out what changes may need to be made to the config logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, no need to go into more detail here; I just added this note as a result of some digging I did, as a note to ourselves.


This will mean vendoring the generated CRD manifests as outlined in https://github.com/openshift/api/tree/master?tab=readme-ov-file#vendoring-generated-manifests-into-other-repositories and adding a new controller to manage the CRD.

Managing the CRD will consist of ensuring that the CRD is present on the cluster, and matches the desired manifest, when the authentication type is _not_ OIDC, and ensuring the CRD is not present present on the cluster when the authentication type _is_ OIDC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Managing the CRD will consist of ensuring that the CRD is present on the cluster, and matches the desired manifest, when the authentication type is _not_ OIDC, and ensuring the CRD is not present present on the cluster when the authentication type _is_ OIDC.
Managing the CRD will consist of ensuring that the CRD is present on the cluster, and matches the desired manifest, when the authentication type is _not_ OIDC, and ensuring the CRD is not present on the cluster when the authentication type _is_ OIDC and OIDC configuration has been rolled out.

If we remove the CRD the moment the auth type becomes OIDC, we won't give time to the admins to react in case any RBRs exist, as the CRD will be removed immediately (and therefore any existing resources). I believe we'll want this in two steps: CAO complains if RBRs exist, and doesn't proceed with OIDC rollout. Once they are deleted, OIDC rollout proceeds. Once it is completed and OIDC is available (we'll use the new API field for that), OAuth cleanup starts, which includes deleting the CRD.

For the moment, this is the condition used to determine when OIDC has been enabled: https://github.com/openshift/cluster-authentication-operator/pull/740/files#diff-51c6cd196c758006bbe84eed012e6baac4713a856a96b7dfd10adc8ad7986e48R20

When we'll have the new API though, we'll use that to determine that it's available (i.e. Available=True). The KAS-o config observer will make sure to update the status accordingly when it detects that the KAS pods have been rolled out with OIDC.


The OIDC authentication mode on the cluster will not be allowed to be enabled if any `RoleBindingRestriction` resources exist.

To communicate the reason for the enablement of the OIDC functionality being blocked, the `Authentication` API will be extended with a new status field to communicate the condition of the OIDC feature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss further how we'll communicate this; for example, we can set Available=False/Degraded=True when RBRs exist. We'll need to also take care of some corner cases, e.g. what if someone creates RBRs after the CAO has started the rollout, but before the KAS pods have restarted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to discussing further how we communicate this. I'll go into a bit more detail on this and then we can refine it from there.

For the corner case where a RBR is created after the CAO has already started the rollout process but before the KAS pods have restarted, my expectation is that we remove the CRD, which in turn deletes the CRs (in this case the newly created RBRs). We can discuss this a bit further if we think that this is an unacceptable user experience, but I think this would be OK for now. We could add warnings in the OpenShift documentation for enabling OIDC that any RBRs created during the rollout of the OIDC functionality will be automatically removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my expectation is that we remove the CRD, which in turn deletes the CRs

I also think this sounds good enough for now 👍

@everettraven everettraven requested a review from liouk January 8, 2025 14:21
@liouk
Copy link
Member

liouk commented Jan 10, 2025

/lgtm

Holding until update commits are squashed.
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 10, 2025
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2025
@everettraven everettraven force-pushed the update/external-oidc-apiserver-impact branch from 6a23a52 to 611371b Compare January 10, 2025 13:33
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2025
@everettraven everettraven force-pushed the update/external-oidc-apiserver-impact branch from 611371b to 897ae74 Compare January 10, 2025 13:38
@liouk
Copy link
Member

liouk commented Jan 10, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2025
@liouk
Copy link
Member

liouk commented Jan 10, 2025

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 10, 2025

- Disable the `authorization.openshift.io/RestrictSubjectBindings` and `authorization.openshift.io/ValidateRoleBindingRestriction` admission plugins
- Remove the `rolebindingrestrictions.authorization.openshift.io` CustomResourceDefinition
- Block OIDC enablement while any `RoleBindingRestriction` resources exist; This will be communicated in the `Authentication` resource via the `OIDCConfig` status field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we did this, how would an HCP user be aware?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does HCP already alert users on cluster operator degraded conditions? This status condition on the Authentication resource is in addition to the existing cluster-authentication-operator cluster operator conditions (don't recall the exact resource type off the top of my head)


##### Changes to the cluster-kube-apiserver-operator

When authentication type is set to OIDC, the `authorization.openshift.io/RestrictSubjectBindings` and `authorization.openshift.io/ValidateRoleBindingRestriction` admission plugins will be disabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be specific. Is this spec or status? Can you describe what should happen in the transition case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in c32abf3


This will mean vendoring the generated CRD manifests as outlined in https://github.com/openshift/api/tree/master?tab=readme-ov-file#vendoring-generated-manifests-into-other-repositories and adding a new controller to manage the CRD.

Managing the CRD will consist of ensuring that the CRD is present on the cluster, and matches the desired manifest, when the authentication type is _not_ OIDC, and ensuring the CRD is not present present on the cluster when the authentication type _is_ OIDC and OIDC configuration has been successfully rolled out.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking through cases, is it actually

  1. internal oath server is desired, ensure crd/rolebindingrestriction
  2. internal oauth server is configured and not desired and any rolebindingrestrictions currently exist, ensure crd/rolebindingrestriction. This supports a migration case to avoid removing groups and users while the tokens are still honored
  3. internal oauth server is configured and not desired and no rolebindingrestrictions currently exist, remove crd/rolebindingrestriction. This supports a migration case.
  4. internal oauth server is not configuration and not desired, remove crd/rolebindingrestriction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that all sounds right. I'll update this section to be more clear on the different scenarios.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in c32abf3

Additionally, the CAO will be updated to block OIDC configuration on existence of `RoleBindingRestriction` resources. If `RoleBindingRestriction` resources are found,
the Authentication CR's `OIDCConfig` status field will be updated to contain the following conditions:

- Condition: `Progressing`, Status: `False`, Reason: `Blocked`, Message: `OIDC configuration blocked: RoleBindingRestriction resources found`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't force progressing to false. it'll go progressing for other reasons. Instead use Degraded.

Copy link
Contributor Author

@everettraven everettraven Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for a non-cluster operator related resource (and by this I mean when the cluster operator sync fails this resource shouldn't be updated). This api field doesn't currently exist and therefore nothing is setting it to progressing. I'm fine with changing this to degraded if you think that makes the most sense for this case, but we were trying to be cautious of setting degraded unless the cluster is actually in a broken state.

// +patchStrategy=merge
// +listType=map
// +listMapKey=type
Conditions []metav1.Condition `json:"conditions"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather that conditions, do we really want to know

  1. required: is the internal oauth server active
  2. optional: is the external OIDC configured. (what even knows this. nothing?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to know when the new revisions of the kube-apiserver have successfully rolled out where it is interacting directly with the external OIDC provider for authn decisions.

We thought the conditions pattern would allow:

  • cluster-authentication-operator to communicate to users progressing status of the oidc configuration
  • cluster-kube-apiserver-operator to communicate to the cluster-authentication-operator and users when the apiserver rollout with this configuration was successful and ready to be used
    • In the case of the cluster-authentication-operator noticing that the rollout was successful, it would begin to remove the oauth workloads and resources from the cluster

I don't think we care about the state of the oauth server unless something goes wrong tearing it down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can deduce whether external OIDC has been configured by the following:

  • is there a revisioned OIDC structured auth configmap for each observed current revision of the KAS pods?
  • is the respective structured auth KAS CLI arg enabled for each observed current revision of the KAS pods?
  • are the respective OAuth KAS CLI args disabled for each observed current revision of the KAS pods?

If all the above are true, we can deduce that external OIDC has been configured and rolled out.

If at least one of the KAS pods is on a revision that does not include an OIDC specific config, there is a rollout in progress which is either enabling, or disabling OIDC.

The KAS-o can monitor the rollout status of the KAS pods and update the OIDCConfig.Conditions accordingly.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 13, 2025
Copy link
Contributor

@deads2k deads2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept matches what I expected. Some questions on the conditionals, but I'm hopeful it falls out easily.


- Logs in the KAS reflecting that informers for OAuth APIs could not be started
- Logs in the KAS reflecting that the oauth-apiserver is unreachable
- Errors, that are not actionable from the perspective of users, occurring during validation of `RoleBinding` resources when a `RoleBindingRestriction` resource in the same namespace specifies user/group restrictions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps clarify that this can prevent the creation of rolebindings in namespaces, which makes it impossible to manage permissions in that namespace.


These impacts may cause concern for cluster administrators that their cluster is in an unhealthy state and may not be easily correlated to the enablement of the OIDC functionality.

To ensure a holistic change in cluster behavior when enabling the OIDC functionality, and reduce potentially concerning behavior in the control plane components, it is proposed that changes are made to:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to do these things when the internal oauth-server is not used. OIDC enablement or not is immaterial, right?

This is because the user to group mapping used by the admission plugin won't be accurate if the internal oauth-server doesn't produce the users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to do these things when the internal oauth-server is not used. OIDC enablement or not is immaterial, right?

Correct. I'll update this.


##### Changes to the cluster-kube-apiserver-operator

When the `Authentication` resource with name `cluster` has `.spec.type` set to `OIDC`, the `authorization.openshift.io/RestrictSubjectBindings` and `authorization.openshift.io/ValidateRoleBindingRestriction` admission plugins will be disabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the admission plugin should be disabled when .spec.type is set to something other than IntegratedOAuth since the user/group mappings will be invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is something @liouk and I had an open question on and were going to ask in the arch call tomorrow. We weren't sure if setting None was equivalent to "I don't want the OpenShift oauth server" and should result in this teardown behavior or if it was more of a "keep the oauth server up and configured, but don't manage it anymore".

I think it sounds reasonable to have this be the behavior when anything other than IntegratedOAuth is set.


To support the need to remove the `rolebindingrestrictions.authorization.openshift.io` CustomResourceDefinition when the OpenShift oauth server is not desired, the cluster-authentication-operator will be updated to manage this CRD.

This will mean vendoring the generated CRD manifests as outlined in https://github.com/openshift/api/tree/master?tab=readme-ov-file#vendoring-generated-manifests-into-other-repositories and adding a new controller to manage the CRD.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the static resource controller has the concept of conditional resources with separate conditions for creation and deletion. Have a look to see if you can wire it up to do what you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I had in mind - I'll make it a bit more explicit.


##### Changes to openshift/api

As the cluster-authentication-operator will now be responsible for the `rolebindingrestrictions.authorization.openshift.io` CRD, it should no longer be added to the openshift/api payload manifests that are included in a payload image and get managed by CVO.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to be sure this and the admission plugin are disabled in microshift?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with microshift beyond having heard of it before. If it doesn't make use of the OpenShift oauth server then I would expect that the RBR CRD and associated admission plugins are disabled in microshift. I'll add another section to capture this.

@everettraven everettraven force-pushed the update/external-oidc-apiserver-impact branch 2 times, most recently from 7d34b35 to 367aa3f Compare January 28, 2025 14:18
@liouk
Copy link
Member

liouk commented Jan 28, 2025

/lgtm

- `Group` informer creation and configuration is moved into the `authorization.openshift.io/RestrictSubjectBindings` admission plugin initialization process
- Configuration options are added to the `authorization.openshift.io/RestrictSubjectBindings` admission plugin, allowing for external components to specify the desired state of the OAuth stack.

The configuration option proposed is as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure I'm being dense but... why do we need this plugin configuration? KASO writes it but what reads it? It seems like the KAS configuration is still controlled by {enable|disable}-admission-plugins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugin configuration is read by the kube-apiserver during startup and is passed to the admission plugin. We are still explicitly enabling/disabling the admission plugin(s) to be more easily viewable and understandable that we are intentionally disabling the plugins.

Disabling an admission plugin solely with the --disable-admission-plugins flag doesn't seem to prevent initialization of the plugin which, in this case, attempts to configure and start informers for an API that will not be present on the cluster. This isn't in and of itself a major issue, but results in a lot of suspicious and noisy logs. Adding the plugin configuration piece allows us to prevent the problematic initialization as the configuration step happens prior to the initialization step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling an admission plugin solely with the --disable-admission-plugins flag doesn't seem to prevent initialization of the plugin

This seems like a bug with --disable-admission-plugins no? If you disable it but some part of it still runs? But somehow the plugin configuration is fed in early enough to stop the whole plugin?

Said another way, if --disable-admission-plugins actually disabled the plugin (informers included), then we wouldn't need the plugin configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some additional exploration on this and I was unable to reproduce this behavior which tells me this was a mistake I made during my initial testing of this. I've updated this PR to no longer include the addition of a configuration option for the admission plugin and to instead rely solely on the --enable-admission-plugins and --disable-admission-plugins flags.

@everettraven everettraven force-pushed the update/external-oidc-apiserver-impact branch from 367aa3f to 78a0079 Compare January 29, 2025 17:55
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 29, 2025
Copy link
Contributor

openshift-ci bot commented Jan 29, 2025

@everettraven: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sjenning
Copy link
Contributor

LGTM

@sjenning
Copy link
Contributor

/approve

Copy link
Contributor

openshift-ci bot commented Jan 29, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 29, 2025
@liouk
Copy link
Member

liouk commented Jan 30, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 30, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit bcc5a12 into openshift:master Jan 30, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants