Fix router reload errors for deleted certificates #18587

ramr · 2018-02-13T03:37:43Z

When a router is reloaded after a batch of route/ingress changes are committed, haproxy sometimes fail to reload. This can happen if a new request to delete a route (and so delete the associated certificates) is processed when the haproxy router is reloading. The router does recover on subsequent reloads when the config changes are actually processed.
See associated error log:
https://gist.github.com/ramr/122d70591d1fb8f97820869f7ca5550f

The change here defers the deletes till commit time and ensures we only delete the certificates for the routes that are being processed as part of the current batchset.

To recreate:
Just create and delete a number of routes in a loop ala:

function create_routes() {
  local n=${1:-$NROUTES}
  for i in `seq $n`; do
    echo "  - Creating route header-test-route-$i ... "
    sed "s/%1/$i/g" route.json.template | oc create -f -
  done
}

function delete_routes() {
  local n=${1:-$NROUTES}
  for i in `seq $n`; do
    echo "  - Deleting route header-test-route-$i ... "
    oc delete route header-test-route-$i
    # [ $((i%50)) == 0 ] && sleep 2
  done
}

where route.json.template is just a route with the metadata.name, id and spec.host fields containing some text (%1) that gets substituted.

@knobunc @rajatchopra PTAL Thx

committed, haproxy sometimes fail to reload. This can happen if a new request to delete a route (and so delete the associated certificates) is processed when the haproxy router is reloading. The router does recover on subsequent reloads when the config changes are actually processed. The change here defers the deletes till commit time and ensures we only delete the certificates for the routes that are being processed as part of the current batchset.

openshift-ci-robot · 2018-02-13T03:37:51Z

Hi @ramr. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pravisankar · 2018-02-13T03:54:11Z

/test

pravisankar · 2018-02-13T08:55:40Z

As part of router start, do we clean up stale certs (i.e. no route pointing to these certs)? If not, adding this will help us in these two cases which can cause stale certs:

Cert deletion on disk failed
We have accumulated certs to be deleted in next turn but before we processed these pending changes, router got terminated/killed.

@knobunc @rajatchopra @ramr

pravisankar

Added couple of comments, rest of the changes looks good to me.

pravisankar · 2018-02-13T08:36:42Z

pkg/router/template/certmanager.go

+	// or commit the removals. Remove all the deleted certificates.
+	for _, certFile := range cm.deletedCertificates {
+		err := cm.w.DeleteCertificate(certFile.CertDir, certFile.ID)
+		if err != nil {


I think we should log an error and ignore these certs. This will be same as what we do in cleanUpServiceAliasConfig() when DeleteCertificatesForConfig() fails.

I agree. There's no harm to having the stray cert around if the delete fails. Let's log loudly and keep going.

Miciah · 2018-02-13T18:32:07Z

We have accumulated certs to be deleted in next turn but before we processed these pending changes, router got terminated/killed.

If the router is terminated or killed, won't the kubelet restart the entire pod? And since the router has no persistent storage, won't accumulated certs be lost when the pod restarts?

pravisankar · 2018-02-13T18:47:06Z

On Tue, Feb 13, 2018 at 10:32 AM, Miciah Dashiel Butler Masters < ***@***.***> wrote: We have accumulated certs to be deleted in next turn but before we processed these pending changes, router got terminated/killed. If the router is terminated or killed, won't the kubelet restart the entire pod? And since the router has no persistent storage, won't accumulated certs be lost when the pod restarts?

Do we support router pod backed by PV/PVC ? If not, then yes we don't have this issue.

…

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#18587 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABM0hvVMoIWTS28RgBFS-wRDSTckGdcpks5tUdUpgaJpZM4SDK_Q> .

knobunc · 2018-02-13T19:34:06Z

Summarizing the conversation we had on the networking scrum. @pravisankar is absolutely correct that the router code has some hooks to support persistent storage. But it is not documented or set up by anything, and the code is untested. We assume it is a vestige of the original code and we intend to remove that since we don't think there is a good use for it. So... it is safe to ignore the case where the certs are backed by a PV in this PR.

knobunc · 2018-02-13T19:35:59Z

pkg/router/template/certmanager.go

+	// or commit the removals. Remove all the deleted certificates.
+	for _, certFile := range cm.deletedCertificates {
+		err := cm.w.DeleteCertificate(certFile.CertDir, certFile.ID)
+		if err != nil {


I agree. There's no harm to having the stray cert around if the delete fails. Let's log loudly and keep going.

knobunc · 2018-02-13T19:38:11Z

pkg/router/template/certmanager.go

+
+	cm.deletedCertificates = make(map[string]certificateFile, 0)
+
+	// If we decide to stage the certificate writes, we can flush the


The only advantage is when a cert changes to make sure that a route version and a cert match. And it should only make a functional difference if there are multiple certs for the same domain (e.g. for different paths?). Is there any case where it would actually matter? I can't think of one... so I think writing them out immediately is fine.

Ok - will add a log message if the delete fails (and keep going).
Yeah, writing the certificates out immediately is ok as those are processed via the crt directive which is only read at haproxy startup.

Miciah · 2018-02-13T19:56:03Z

pkg/router/template/certmanager.go

+
+// certificateFileTag generates a certificate file tag/name. This is used to
+// index into the map of deleted certificates.
+func (cf certificateFile) Tag() string {


The first word in the comment should match the function name (just Tag).

Good catch.

Miciah · 2018-02-13T19:56:52Z

pkg/router/template/certmanager.go

-				if err != nil {
-					return err
-				}
+


Extra blank line.

Miciah · 2018-02-13T20:00:07Z

pkg/router/template/certmanager.go

+type certificateFile struct {
+	CertDir  string
+	ID       string
+	Contents []byte


It doesn't look like anything uses the Contents field.

For now its not used. I did have a comment re: staging the certificate writes as well - which is when we would have used it. But I think that's moot based on the call this morning, so will remove the Contents field.

Miciah · 2018-02-13T20:02:00Z

pkg/router/template/certmanager.go

+type certificateFile struct {
+	CertDir  string
+	ID       string
+	Contents []byte


Is there any reason to make these fields public (upper-case initial letter)?

No particular reason (well other than did have it originally in types.go with an uppercase CertificateFile name) - will change to lower case.

…rted names as per @{pravisankar,Miciah,knobunc} review comments.

ramr · 2018-02-13T22:42:03Z

addressed the review comments.
/cc @knobunc @pravisankar @Miciah

pravisankar · 2018-02-14T01:34:15Z

/lgtm

ramr · 2018-02-14T18:58:38Z

/retest

openshift-ci-robot · 2018-02-14T18:58:53Z

@ramr: you can't request testing unless you are a openshift member.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pravisankar · 2018-02-14T19:07:03Z

/retest

rajatchopra · 2018-02-14T19:07:04Z

/retest

rajatchopra · 2018-02-14T20:28:52Z

/test gcp

ramr · 2018-02-15T05:19:36Z

hmm, is a test flake being hit consistently. All 4 of the errors are the same and around the same time - looks like the tests create a service account and a pod after that (seems to wait for 10s) but the API token for the service account isn't available as yet.
Flake issue filed: #18626

knobunc · 2018-02-15T17:16:14Z

/retest

ramr · 2018-02-21T19:01:26Z

@rajatchopra could you please merge this if all's good. thx

ramr · 2018-03-01T21:15:15Z

@knobunc is anything else needed here? bot says pr has no approved label. thanks.

knobunc · 2018-03-06T19:16:26Z

/approve
/lgtm

knobunc · 2018-03-06T19:16:38Z

/kind bug

knobunc

LGTM

openshift-ci-robot · 2018-03-06T19:16:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc, pravisankar, ramr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/router/OWNERS~~ [knobunc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-merge-robot · 2018-03-06T20:50:05Z

/test all [submit-queue is verifying that this PR is safe to merge]

openshift-merge-robot · 2018-03-07T00:26:20Z

Automatic merge from submit-queue (batch tested with PRs 18587, 18296, 18667, 18665, 18532).

openshift-ci-robot · 2018-03-07T01:43:56Z

@ramr: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/openshift-jenkins/extended_conformance_install	`456bb4b`	link	`/test extended_conformance_install`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 13, 2018

openshift-ci-robot requested review from pravisankar and rajatchopra February 13, 2018 03:37

openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 13, 2018

pravisankar removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 13, 2018

pravisankar added the component/routing label Feb 13, 2018

pravisankar reviewed Feb 13, 2018

View reviewed changes

knobunc requested changes Feb 13, 2018

View reviewed changes

Miciah reviewed Feb 13, 2018

View reviewed changes

Cleanup comments, log delete failed message as a warning and fix expo…

456bb4b

…rted names as per @{pravisankar,Miciah,knobunc} review comments.

openshift-ci-robot requested review from knobunc, Miciah and pravisankar February 13, 2018 22:42

pravisankar approved these changes Feb 14, 2018

View reviewed changes

openshift-ci-robot assigned pravisankar Feb 14, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2018

openshift-ci-robot assigned knobunc Mar 6, 2018

openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 6, 2018

knobunc approved these changes Mar 6, 2018

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2018

openshift-merge-robot merged commit 0e4c94a into openshift:master Mar 7, 2018


		cm.deletedCertificates = make(map[string]certificateFile, 0)

		// If we decide to stage the certificate writes, we can flush the

Fix router reload errors for deleted certificates #18587

Fix router reload errors for deleted certificates #18587

Conversation

ramr commented Feb 13, 2018

openshift-ci-robot commented Feb 13, 2018

pravisankar commented Feb 13, 2018

pravisankar commented Feb 13, 2018 • edited Loading

pravisankar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Miciah commented Feb 13, 2018

pravisankar commented Feb 13, 2018 via email

knobunc commented Feb 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ramr commented Feb 13, 2018

pravisankar commented Feb 14, 2018

ramr commented Feb 14, 2018

openshift-ci-robot commented Feb 14, 2018

pravisankar commented Feb 14, 2018

rajatchopra commented Feb 14, 2018

rajatchopra commented Feb 14, 2018

ramr commented Feb 15, 2018

knobunc commented Feb 15, 2018

ramr commented Feb 21, 2018

ramr commented Mar 1, 2018

knobunc commented Mar 6, 2018

knobunc commented Mar 6, 2018

knobunc left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Mar 6, 2018

openshift-merge-robot commented Mar 6, 2018

openshift-merge-robot commented Mar 7, 2018

openshift-ci-robot commented Mar 7, 2018

pravisankar commented Feb 13, 2018 •

edited

Loading