Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix router reload errors for deleted certificates #18587

Merged

Conversation

ramr
Copy link
Contributor

@ramr ramr commented Feb 13, 2018

When a router is reloaded after a batch of route/ingress changes are committed, haproxy sometimes fail to reload. This can happen if a new request to delete a route (and so delete the associated certificates) is processed when the haproxy router is reloading. The router does recover on subsequent reloads when the config changes are actually processed.
See associated error log:
https://gist.github.com/ramr/122d70591d1fb8f97820869f7ca5550f

The change here defers the deletes till commit time and ensures we only delete the certificates for the routes that are being processed as part of the current batchset.

To recreate:
Just create and delete a number of routes in a loop ala:

function create_routes() {
  local n=${1:-$NROUTES}
  for i in `seq $n`; do
    echo "  - Creating route header-test-route-$i ... "
    sed "s/%1/$i/g" route.json.template | oc create -f -
  done
}

function delete_routes() {
  local n=${1:-$NROUTES}
  for i in `seq $n`; do
    echo "  - Deleting route header-test-route-$i ... "
    oc delete route header-test-route-$i
    # [ $((i%50)) == 0 ] && sleep 2
  done
}

where route.json.template is just a route with the metadata.name, id and spec.host fields containing some text (%1) that gets substituted.

@knobunc @rajatchopra PTAL Thx

committed, haproxy sometimes fail to reload. This can happen if a new
request to delete a route (and so delete the associated certificates)
is processed when the haproxy router is reloading. The router does
recover on subsequent reloads when the config changes are actually
processed. The change here defers the deletes till commit time and
ensures we only delete the certificates for the routes that are being
processed as part of the current batchset.
@openshift-ci-robot openshift-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 13, 2018
@openshift-ci-robot
Copy link

Hi @ramr. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 13, 2018
@pravisankar pravisankar removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 13, 2018
@pravisankar
Copy link

/test

@pravisankar
Copy link

pravisankar commented Feb 13, 2018

As part of router start, do we clean up stale certs (i.e. no route pointing to these certs)? If not, adding this will help us in these two cases which can cause stale certs:

  • Cert deletion on disk failed
  • We have accumulated certs to be deleted in next turn but before we processed these pending changes, router got terminated/killed.

@knobunc @rajatchopra @ramr

Copy link

@pravisankar pravisankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added couple of comments, rest of the changes looks good to me.

// or commit the removals. Remove all the deleted certificates.
for _, certFile := range cm.deletedCertificates {
err := cm.w.DeleteCertificate(certFile.CertDir, certFile.ID)
if err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should log an error and ignore these certs. This will be same as what we do in cleanUpServiceAliasConfig() when DeleteCertificatesForConfig() fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. There's no harm to having the stray cert around if the delete fails. Let's log loudly and keep going.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@Miciah
Copy link
Contributor

Miciah commented Feb 13, 2018

We have accumulated certs to be deleted in next turn but before we processed these pending changes, router got terminated/killed.

If the router is terminated or killed, won't the kubelet restart the entire pod? And since the router has no persistent storage, won't accumulated certs be lost when the pod restarts?

@pravisankar
Copy link

pravisankar commented Feb 13, 2018 via email

@knobunc
Copy link
Contributor

knobunc commented Feb 13, 2018

Summarizing the conversation we had on the networking scrum. @pravisankar is absolutely correct that the router code has some hooks to support persistent storage. But it is not documented or set up by anything, and the code is untested. We assume it is a vestige of the original code and we intend to remove that since we don't think there is a good use for it. So... it is safe to ignore the case where the certs are backed by a PV in this PR.

// or commit the removals. Remove all the deleted certificates.
for _, certFile := range cm.deletedCertificates {
err := cm.w.DeleteCertificate(certFile.CertDir, certFile.ID)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. There's no harm to having the stray cert around if the delete fails. Let's log loudly and keep going.


cm.deletedCertificates = make(map[string]certificateFile, 0)

// If we decide to stage the certificate writes, we can flush the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only advantage is when a cert changes to make sure that a route version and a cert match. And it should only make a functional difference if there are multiple certs for the same domain (e.g. for different paths?). Is there any case where it would actually matter? I can't think of one... so I think writing them out immediately is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - will add a log message if the delete fails (and keep going).
Yeah, writing the certificates out immediately is ok as those are processed via the crt directive which is only read at haproxy startup.


// certificateFileTag generates a certificate file tag/name. This is used to
// index into the map of deleted certificates.
func (cf certificateFile) Tag() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first word in the comment should match the function name (just Tag).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

if err != nil {
return err
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra blank line.

type certificateFile struct {
CertDir string
ID string
Contents []byte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like anything uses the Contents field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now its not used. I did have a comment re: staging the certificate writes as well - which is when we would have used it. But I think that's moot based on the call this morning, so will remove the Contents field.

type certificateFile struct {
CertDir string
ID string
Contents []byte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to make these fields public (upper-case initial letter)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason (well other than did have it originally in types.go with an uppercase CertificateFile name) - will change to lower case.

…rted

names as per @{pravisankar,Miciah,knobunc} review comments.
@ramr
Copy link
Contributor Author

ramr commented Feb 13, 2018

addressed the review comments.
/cc @knobunc @pravisankar @Miciah

@pravisankar
Copy link

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2018
@ramr
Copy link
Contributor Author

ramr commented Feb 14, 2018

/retest

@openshift-ci-robot
Copy link

@ramr: you can't request testing unless you are a openshift member.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pravisankar
Copy link

/retest

1 similar comment
@rajatchopra
Copy link
Contributor

/retest

@rajatchopra
Copy link
Contributor

/test gcp

@ramr
Copy link
Contributor Author

ramr commented Feb 15, 2018

hmm, is a test flake being hit consistently. All 4 of the errors are the same and around the same time - looks like the tests create a service account and a pod after that (seems to wait for 10s) but the API token for the service account isn't available as yet.
Flake issue filed: #18626

@knobunc
Copy link
Contributor

knobunc commented Feb 15, 2018

/retest

@ramr
Copy link
Contributor Author

ramr commented Feb 21, 2018

@rajatchopra could you please merge this if all's good. thx

@ramr
Copy link
Contributor Author

ramr commented Mar 1, 2018

@knobunc is anything else needed here? bot says pr has no approved label. thanks.

@knobunc
Copy link
Contributor

knobunc commented Mar 6, 2018

/approve
/lgtm

@knobunc
Copy link
Contributor

knobunc commented Mar 6, 2018

/kind bug

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 6, 2018
Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc, pravisankar, ramr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2018
@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 18587, 18296, 18667, 18665, 18532).

@openshift-merge-robot openshift-merge-robot merged commit 0e4c94a into openshift:master Mar 7, 2018
@openshift-ci-robot
Copy link

@ramr: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/extended_conformance_install 456bb4b link /test extended_conformance_install

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. component/routing kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants