Retry import to the docker hub on 401 #8558

smarterclayton · 2016-04-19T05:04:45Z

The DockerHub apparently can race / flake / return 401 on some responses
to retrieve public images (approx 1/200 imports?). Perform a simple
retry on 401 from the DockerHub up to a few times.

Still needs to be made more specific to DockerHub, and also needs a selective fix for the legacy pkg/dockerregistry client.

Fixes #7429 #8399 #8444

smarterclayton · 2016-04-19T05:05:06Z

Not complete yet, but mostly complete [test]

miminar · 2016-04-19T07:12:14Z

pkg/image/importer/importer.go

+	}
+}
+
+// Get retrieves the identified by the digest, if it exists.


s/identified/Manifest \0/

soltysh · 2016-04-19T11:36:56Z

Mostly nits, otherwise it LGTM.

I hate we're doing yet another docker-specific logic inside the code.

smarterclayton · 2016-04-19T17:04:01Z

This is reworked from before - we retry immediately initially, then wait afterwards up to either a max time or max retries (2). Comments also reflect that this is about the docker hub JWT token being in the same second as the request and thus not valid (they don't backdate JWT tokens).

DockerHub generates JWT tokens that are valid from "now", which means fast actions can get a 401 unauthorized if the second boundary on the hub servers is not aligned. We retry 401 unauthorized requests immediately (one time), then wait a small window of time before retrying. If our operation is too slow (longer than the window) we do not retry. Probably an issue with Docker Trusted Registries as well.

liggitt · 2016-04-19T18:08:51Z

given that the root issue is a dockerhub flake, I think this should be post-1.2

liggitt · 2016-04-19T18:09:58Z

reported flake at docker/hub-feedback#647 with reproducible testcase at https://gist.github.com/liggitt/e7d06dc1cbc9d44034f762c5654c6fff

miminar · 2016-04-19T18:42:38Z

@liggitt rules!

smarterclayton · 2016-04-19T19:22:54Z

Hrm, pull through still failed - because I don't have the retry logic in the registry.

We want to run at least one time at the full interval mark, which we were not doing before.

soltysh · 2016-04-19T20:06:59Z

pkg/image/importer/client_test.go

+
+	// nil error doesn't consume retries
+	if r.shouldRetry(nil) {
+		t.Fatal(r)


It's still unclear to me, why you're doing Fatalf here instead of just Errorf?

Same applies to next test case, as well.

Because there's no reason to continue the test if the condition is violated

On Tue, Apr 19, 2016 at 4:07 PM, Maciej Szulik [email protected]
wrote:

In pkg/image/importer/client_test.go
#8558 (comment):

t.Errorf("Expected ErrNotV2Registry, got %v", err)

}

uri.Scheme = "https"

_, err = retriever.ping(*uri, true, retriever.context.InsecureTransport)

if !strings.Contains(err.Error(), "does not support v2 API") {

t.Errorf("Expected ErrNotV2Registry, got %v", err)

}
+}

+func TestShouldRetry(t *testing.T) {

r := NewRetryRepository(nil, 1, 0).(*retryRepository)

// nil error doesn't consume retries

if r.shouldRetry(nil) {

t.Fatal(r)

It's still unclear to me, why you're doing Fatalf here instead of just
Errorf?

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/8558/files/e1c50b07d5369b627aa475724c2689dbdeb61a98#r60301490

Make sense, still I personally prefer to run as much tests as possible. One violated condition doesn't mean all others will be incorrect.

In this case it does - if you don't handle nil errors correctly the logic
of the rest won't work.

On Tue, Apr 19, 2016 at 5:26 PM, Maciej Szulik [email protected]
wrote:

In pkg/image/importer/client_test.go
#8558 (comment):

t.Errorf("Expected ErrNotV2Registry, got %v", err)

}

uri.Scheme = "https"

_, err = retriever.ping(*uri, true, retriever.context.InsecureTransport)

if !strings.Contains(err.Error(), "does not support v2 API") {

t.Errorf("Expected ErrNotV2Registry, got %v", err)

}
+}

+func TestShouldRetry(t *testing.T) {

r := NewRetryRepository(nil, 1, 0).(*retryRepository)

// nil error doesn't consume retries

if r.shouldRetry(nil) {

t.Fatal(r)

Make sense, still I personally prefer to run as much tests as possible.
One violated condition doesn't mean all others will be incorrect.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/8558/files/e1c50b07d5369b627aa475724c2689dbdeb61a98#r60314653

I see what you're talking about, the check after the initial fail doesn't make sense - agreed with that. But additional shouldRetry checks still do. I'd generally convert all these in to table tests, skipping the additional checks if shouldRetry returned wrong output, but at least going through all possibilities. That was my intention rising this :)

smarterclayton · 2016-04-19T21:24:58Z

Excellent - I can now consistently recreate the pull through flake, even
though this in theory fixes it.

On Tue, Apr 19, 2016 at 4:50 PM, OpenShift Bot [email protected]
wrote:

continuous-integration/openshift-jenkins/test FAILURE (
https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/3107/)

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#8558 (comment)

smarterclayton · 2016-04-19T21:36:01Z

So https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_integration/187/artifact/origin/artifacts/test-end-to-end-docker/logs/container-docker-registry-1-sirfi-registry.log just failed for pull through, with this error:

172.18.11.255 - - [19/Apr/2016:20:45:12 +0000] "GET /v2/cache/mysql/blobs/sha256:efd26ecc95486998b41b3fe167236e3fb3e109c66dd1a51ab5161e40b06cf486 HTTP/1.1" 200 51342828 "" "docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64"
172.18.11.255 - - [19/Apr/2016:20:45:18 +0000] "GET /v1/repositories/cache/mysql/images HTTP/1.1" 404 19 "" "docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64"

@soltysh could this be your context bug thing? I'm recreating this consistently - are we not writing the response correctly?

smarterclayton · 2016-04-19T21:48:28Z

So the problem now with pull through is not an import failure - it's some other condition

openshift-bot · 2016-04-19T22:15:20Z

Evaluated for origin test up to 9dd0f31

openshift-bot · 2016-04-20T00:03:44Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/3113/)

smarterclayton · 2016-04-20T00:26:46Z

Suspicious [merge]

openshift-bot · 2016-04-20T00:30:23Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5639/) (Image: devenv-rhel7_4001)

openshift-bot · 2016-04-20T00:30:23Z

Evaluated for origin merge up to 9dd0f31

smarterclayton · 2016-04-20T02:06:23Z

Crap, did not mean to merge this.

liggitt · 2016-04-20T02:54:32Z

pkg/image/importer/client.go

+		return false
+	}
+
+	if r.retries <= 0 {


don't know the calling code well enough to know if we need to take a lock here to avoid racing between goroutines

No concurrent calls (underlying docker client is not thread safe)

On Apr 19, 2016, at 10:54 PM, Jordan Liggitt [email protected]
wrote:

In pkg/image/importer/client.go
#8558 (comment):

wait: wait,

limit: interval,

}
+}
+
+// shouldRetry returns true if the error is not an unauthorized error, if there are no retries left, or if
+// we have already retried once and it has been longer than r.limit since we retried the first time.
+func (r *retryRepository) shouldRetry(err error) bool {

if err == nil {

return false

}

if !isDockerError(err, errcode.ErrorCodeUnauthorized) {

return false

}
+

if r.retries <= 0 {

don't know the calling code well enough to know if we need to take a lock
here to avoid racing between goroutines

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/pull/8558/files/9dd0f3197b9856bd93d606507abddfe896f110f3#r60344323

soltysh · 2016-04-20T09:15:13Z

@soltysh could this be your context bug thing? I'm recreating this consistently - are we not writing the response correctly?

I've identified following problems here, although it looks like the following are caused by the first one:

401 Unathorized

time="2016-04-19T20:45:11.048087165Z" level=error msg="Failed copying content from remote store \"sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4\": unauthorized: authentication required"...

Content-Length problem:

time="2016-04-19T20:45:11.048213314Z" level=error msg="error serving error json: Conn.Write wrote more than the declared Content-Length (from unknown: unknown error)"

So the problem now with pull through is not an import failure - it's some other condition

The auth problem bites us in 2 separate places, as I described in #8399 (comment):

at import time
at pull time

With this PR you've addressed import, but still pullthrough logic is affected.

soltysh · 2016-04-20T09:20:21Z

@smarterclayton we'd have to add the retry logic as well in pullthroublobstore.go 's Stat and ServeBlob methods, agreed?

smarterclayton · 2016-04-20T14:26:33Z

We need to add the logic to the blob store wrapper in retry. Opened #8574

This was referenced Apr 19, 2016

test-cmd new-app image stream import flake #7429

Closed

Error: image cache/mysql:pullthrough not found #8399

Closed

miminar reviewed Apr 19, 2016
View reviewed changes

pkg/image/importer/importer.go

}

}

// Get retrieves the identified by the digest, if it exists.

Copy link

miminar Apr 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/identified/Manifest \0/

soltysh mentioned this pull request Apr 19, 2016

Change the context being used when Stat-ing remote registry #8525

Merged

smarterclayton force-pushed the retry_import branch from b60d5ea to a659943 Compare April 19, 2016 17:00

smarterclayton force-pushed the retry_import branch from a659943 to 7093846 Compare April 19, 2016 18:05

Update 1 - run to the entire interval

e1c50b0

We want to run at least one time at the full interval mark, which we were not doing before.

soltysh reviewed Apr 19, 2016
View reviewed changes

Change the context being used when Stat-ing remote registry

9dd0f31

openshift-bot merged commit 1a02064 into openshift:master Apr 20, 2016

smarterclayton mentioned this pull request Apr 20, 2016

Revert "Retry import to the docker hub on 401" #8567

Merged

liggitt reviewed Apr 20, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry import to the docker hub on 401 #8558

Retry import to the docker hub on 401 #8558

smarterclayton commented Apr 19, 2016

smarterclayton commented Apr 19, 2016

miminar Apr 19, 2016

soltysh commented Apr 19, 2016

smarterclayton commented Apr 19, 2016

liggitt commented Apr 19, 2016

liggitt commented Apr 19, 2016 •

edited

Loading

miminar commented Apr 19, 2016

smarterclayton commented Apr 19, 2016

soltysh Apr 19, 2016

soltysh Apr 19, 2016

smarterclayton Apr 19, 2016

soltysh Apr 19, 2016

smarterclayton Apr 19, 2016

soltysh Apr 20, 2016

smarterclayton commented Apr 19, 2016

smarterclayton commented Apr 19, 2016

smarterclayton commented Apr 19, 2016

openshift-bot commented Apr 19, 2016

openshift-bot commented Apr 20, 2016

smarterclayton commented Apr 20, 2016 via email

openshift-bot commented Apr 20, 2016

openshift-bot commented Apr 20, 2016

smarterclayton commented Apr 20, 2016

liggitt Apr 20, 2016

smarterclayton Apr 20, 2016

soltysh commented Apr 20, 2016

soltysh commented Apr 20, 2016

smarterclayton commented Apr 20, 2016

Retry import to the docker hub on 401 #8558

Retry import to the docker hub on 401 #8558

Conversation

smarterclayton commented Apr 19, 2016

smarterclayton commented Apr 19, 2016

Choose a reason for hiding this comment

soltysh commented Apr 19, 2016

smarterclayton commented Apr 19, 2016

liggitt commented Apr 19, 2016

liggitt commented Apr 19, 2016 • edited Loading

miminar commented Apr 19, 2016

smarterclayton commented Apr 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton commented Apr 19, 2016

smarterclayton commented Apr 19, 2016

smarterclayton commented Apr 19, 2016

openshift-bot commented Apr 19, 2016

openshift-bot commented Apr 20, 2016

smarterclayton commented Apr 20, 2016 via email

openshift-bot commented Apr 20, 2016

openshift-bot commented Apr 20, 2016

smarterclayton commented Apr 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh commented Apr 20, 2016

soltysh commented Apr 20, 2016

smarterclayton commented Apr 20, 2016

liggitt commented Apr 19, 2016 •

edited

Loading