-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix router reload errors for deleted certificates #18587
Fix router reload errors for deleted certificates #18587
Conversation
committed, haproxy sometimes fail to reload. This can happen if a new request to delete a route (and so delete the associated certificates) is processed when the haproxy router is reloading. The router does recover on subsequent reloads when the config changes are actually processed. The change here defers the deletes till commit time and ensures we only delete the certificates for the routes that are being processed as part of the current batchset.
Hi @ramr. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test |
As part of router start, do we clean up stale certs (i.e. no route pointing to these certs)? If not, adding this will help us in these two cases which can cause stale certs:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added couple of comments, rest of the changes looks good to me.
pkg/router/template/certmanager.go
Outdated
// or commit the removals. Remove all the deleted certificates. | ||
for _, certFile := range cm.deletedCertificates { | ||
err := cm.w.DeleteCertificate(certFile.CertDir, certFile.ID) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should log an error and ignore these certs. This will be same as what we do in cleanUpServiceAliasConfig() when DeleteCertificatesForConfig() fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. There's no harm to having the stray cert around if the delete fails. Let's log loudly and keep going.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
If the router is terminated or killed, won't the kubelet restart the entire pod? And since the router has no persistent storage, won't accumulated certs be lost when the pod restarts? |
On Tue, Feb 13, 2018 at 10:32 AM, Miciah Dashiel Butler Masters < ***@***.***> wrote:
We have accumulated certs to be deleted in next turn but before we
processed these pending changes, router got terminated/killed.
If the router is terminated or killed, won't the kubelet restart the
entire pod? And since the router has no persistent storage, won't
accumulated certs be lost when the pod restarts?
Do we support router pod backed by PV/PVC ? If not, then yes we don't have
this issue.
… —
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#18587 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABM0hvVMoIWTS28RgBFS-wRDSTckGdcpks5tUdUpgaJpZM4SDK_Q>
.
|
Summarizing the conversation we had on the networking scrum. @pravisankar is absolutely correct that the router code has some hooks to support persistent storage. But it is not documented or set up by anything, and the code is untested. We assume it is a vestige of the original code and we intend to remove that since we don't think there is a good use for it. So... it is safe to ignore the case where the certs are backed by a PV in this PR. |
pkg/router/template/certmanager.go
Outdated
// or commit the removals. Remove all the deleted certificates. | ||
for _, certFile := range cm.deletedCertificates { | ||
err := cm.w.DeleteCertificate(certFile.CertDir, certFile.ID) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. There's no harm to having the stray cert around if the delete fails. Let's log loudly and keep going.
|
||
cm.deletedCertificates = make(map[string]certificateFile, 0) | ||
|
||
// If we decide to stage the certificate writes, we can flush the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only advantage is when a cert changes to make sure that a route version and a cert match. And it should only make a functional difference if there are multiple certs for the same domain (e.g. for different paths?). Is there any case where it would actually matter? I can't think of one... so I think writing them out immediately is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok - will add a log message if the delete fails (and keep going).
Yeah, writing the certificates out immediately is ok as those are processed via the crt
directive which is only read at haproxy startup.
pkg/router/template/certmanager.go
Outdated
|
||
// certificateFileTag generates a certificate file tag/name. This is used to | ||
// index into the map of deleted certificates. | ||
func (cf certificateFile) Tag() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first word in the comment should match the function name (just Tag
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
pkg/router/template/certmanager.go
Outdated
if err != nil { | ||
return err | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra blank line.
pkg/router/template/certmanager.go
Outdated
type certificateFile struct { | ||
CertDir string | ||
ID string | ||
Contents []byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like anything uses the Contents
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now its not used. I did have a comment re: staging the certificate writes as well - which is when we would have used it. But I think that's moot based on the call this morning, so will remove the Contents
field.
pkg/router/template/certmanager.go
Outdated
type certificateFile struct { | ||
CertDir string | ||
ID string | ||
Contents []byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to make these fields public (upper-case initial letter)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason (well other than did have it originally in types.go
with an uppercase CertificateFile
name) - will change to lower case.
…rted names as per @{pravisankar,Miciah,knobunc} review comments.
addressed the review comments. |
/lgtm |
/retest |
@ramr: you can't request testing unless you are a openshift member. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
1 similar comment
/retest |
/test gcp |
hmm, is a test flake being hit consistently. All 4 of the errors are the same and around the same time - looks like the tests create a service account and a pod after that (seems to wait for 10s) but the API token for the service account isn't available as yet. |
/retest |
@rajatchopra could you please merge this if all's good. thx |
@knobunc is anything else needed here? bot says pr has no approved label. thanks. |
/approve |
/kind bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: knobunc, pravisankar, ramr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue (batch tested with PRs 18587, 18296, 18667, 18665, 18532). |
@ramr: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
When a router is reloaded after a batch of route/ingress changes are committed, haproxy sometimes fail to reload. This can happen if a new request to delete a route (and so delete the associated certificates) is processed when the haproxy router is reloading. The router does recover on subsequent reloads when the config changes are actually processed.
See associated error log:
https://gist.github.com/ramr/122d70591d1fb8f97820869f7ca5550f
The change here defers the deletes till commit time and ensures we only delete the certificates for the routes that are being processed as part of the current batchset.
To recreate:
Just create and delete a number of routes in a loop ala:
where
route.json.template
is just a route with themetadata.name
,id
andspec.host
fields containing some text (%1
) that gets substituted.@knobunc @rajatchopra PTAL Thx