-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry import to the docker hub on 401 #8558
Conversation
Not complete yet, but mostly complete [test] |
} | ||
} | ||
|
||
// Get retrieves the identified by the digest, if it exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/identified/Manifest \0/
Mostly nits, otherwise it LGTM. I hate we're doing yet another docker-specific logic inside the code. |
b60d5ea
to
a659943
Compare
This is reworked from before - we retry immediately initially, then wait afterwards up to either a max time or max retries (2). Comments also reflect that this is about the docker hub JWT token being in the same second as the request and thus not valid (they don't backdate JWT tokens). |
DockerHub generates JWT tokens that are valid from "now", which means fast actions can get a 401 unauthorized if the second boundary on the hub servers is not aligned. We retry 401 unauthorized requests immediately (one time), then wait a small window of time before retrying. If our operation is too slow (longer than the window) we do not retry. Probably an issue with Docker Trusted Registries as well.
a659943
to
7093846
Compare
given that the root issue is a dockerhub flake, I think this should be post-1.2 |
reported flake at docker/hub-feedback#647 with reproducible testcase at https://gist.github.com/liggitt/e7d06dc1cbc9d44034f762c5654c6fff |
@liggitt rules! |
Hrm, pull through still failed - because I don't have the retry logic in the registry. |
We want to run at least one time at the full interval mark, which we were not doing before.
|
||
// nil error doesn't consume retries | ||
if r.shouldRetry(nil) { | ||
t.Fatal(r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still unclear to me, why you're doing Fatalf
here instead of just Errorf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same applies to next test case, as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there's no reason to continue the test if the condition is violated
On Tue, Apr 19, 2016 at 4:07 PM, Maciej Szulik [email protected]
wrote:
In pkg/image/importer/client_test.go
#8558 (comment):
t.Errorf("Expected ErrNotV2Registry, got %v", err)
- }
- uri.Scheme = "https"
- _, err = retriever.ping(*uri, true, retriever.context.InsecureTransport)
- if !strings.Contains(err.Error(), "does not support v2 API") {
t.Errorf("Expected ErrNotV2Registry, got %v", err)
- }
+}
+func TestShouldRetry(t *testing.T) {
- r := NewRetryRepository(nil, 1, 0).(*retryRepository)
- // nil error doesn't consume retries
- if r.shouldRetry(nil) {
t.Fatal(r)
It's still unclear to me, why you're doing Fatalf here instead of just
Errorf?—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/8558/files/e1c50b07d5369b627aa475724c2689dbdeb61a98#r60301490
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, still I personally prefer to run as much tests as possible. One violated condition doesn't mean all others will be incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it does - if you don't handle nil errors correctly the logic
of the rest won't work.
On Tue, Apr 19, 2016 at 5:26 PM, Maciej Szulik [email protected]
wrote:
In pkg/image/importer/client_test.go
#8558 (comment):
t.Errorf("Expected ErrNotV2Registry, got %v", err)
- }
- uri.Scheme = "https"
- _, err = retriever.ping(*uri, true, retriever.context.InsecureTransport)
- if !strings.Contains(err.Error(), "does not support v2 API") {
t.Errorf("Expected ErrNotV2Registry, got %v", err)
- }
+}
+func TestShouldRetry(t *testing.T) {
- r := NewRetryRepository(nil, 1, 0).(*retryRepository)
- // nil error doesn't consume retries
- if r.shouldRetry(nil) {
t.Fatal(r)
Make sense, still I personally prefer to run as much tests as possible.
One violated condition doesn't mean all others will be incorrect.—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/8558/files/e1c50b07d5369b627aa475724c2689dbdeb61a98#r60314653
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you're talking about, the check after the initial fail doesn't make sense - agreed with that. But additional shouldRetry
checks still do. I'd generally convert all these in to table tests, skipping the additional checks if shouldRetry
returned wrong output, but at least going through all possibilities. That was my intention rising this :)
Excellent - I can now consistently recreate the pull through flake, even On Tue, Apr 19, 2016 at 4:50 PM, OpenShift Bot [email protected]
|
So https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_integration/187/artifact/origin/artifacts/test-end-to-end-docker/logs/container-docker-registry-1-sirfi-registry.log just failed for pull through, with this error:
@soltysh could this be your context bug thing? I'm recreating this consistently - are we not writing the response correctly? |
So the problem now with pull through is not an import failure - it's some other condition |
Evaluated for origin test up to 9dd0f31 |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/3113/) |
Suspicious [merge]
|
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5639/) (Image: devenv-rhel7_4001) |
Evaluated for origin merge up to 9dd0f31 |
Crap, did not mean to merge this. |
return false | ||
} | ||
|
||
if r.retries <= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't know the calling code well enough to know if we need to take a lock here to avoid racing between goroutines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No concurrent calls (underlying docker client is not thread safe)
On Apr 19, 2016, at 10:54 PM, Jordan Liggitt [email protected]
wrote:
In pkg/image/importer/client.go
#8558 (comment):
wait: wait,
limit: interval,
- }
+}
+
+// shouldRetry returns true if the error is not an unauthorized error, if there are no retries left, or if
+// we have already retried once and it has been longer than r.limit since we retried the first time.
+func (r *retryRepository) shouldRetry(err error) bool {- if err == nil {
return false
- }
- if !isDockerError(err, errcode.ErrorCodeUnauthorized) {
return false
- }
+- if r.retries <= 0 {
don't know the calling code well enough to know if we need to take a lock
here to avoid racing between goroutines
—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/8558/files/9dd0f3197b9856bd93d606507abddfe896f110f3#r60344323
I've identified following problems here, although it looks like the following are caused by the first one:
The auth problem bites us in 2 separate places, as I described in #8399 (comment):
With this PR you've addressed import, but still pullthrough logic is affected. |
@smarterclayton we'd have to add the retry logic as well in pullthroublobstore.go 's |
We need to add the logic to the blob store wrapper in retry. Opened #8574 |
The DockerHub apparently can race / flake / return 401 on some responses
to retrieve public images (approx 1/200 imports?). Perform a simple
retry on 401 from the DockerHub up to a few times.
Still needs to be made more specific to DockerHub, and also needs a selective fix for the legacy pkg/dockerregistry client.
Fixes #7429 #8399 #8444