-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Image pruning fixes #16717
Image pruning fixes #16717
Conversation
/retest |
@@ -208,7 +210,7 @@ var _ Pruner = &pruner{} | |||
// | |||
// Also automatically remove any image layer that is no longer referenced by any | |||
// images. | |||
func NewPruner(options PrunerOptions) Pruner { | |||
func NewPruner(options PrunerOptions) (Pruner, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
godoc
pkg/image/prune/prune.go
Outdated
@@ -527,29 +534,31 @@ func addBuildStrategyImageReferencesToGraph(g graph.Graph, strategy buildapi.Bui | |||
_, id, err := imageapi.ParseImageStreamImageName(from.Name) | |||
if err != nil { | |||
glog.V(2).Infof("Error parsing ImageStreamImage name %q: %v - skipping", from.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skipping
- this is not true anymore
pkg/image/prune/prune.go
Outdated
ids := make(graph.NodeSet) | ||
|
||
for _, imageNode := range imageNodes { | ||
glog.V(4).Infof("Examining image %q", imageNode.Image.Name) | ||
|
||
if imageIsPrunable(g, imageNode, algorithm) { | ||
glog.V(4).Infof("Image %q is prunable", imageNode.Image.Name) | ||
prunable = append(prunable, imageNode) | ||
if _, ok := prunable[imageNode.Image.Name]; !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When can this condition be false?
pkg/image/prune/prune.go
Outdated
streamName := getName(streamNode.ImageStream) | ||
|
||
if _, ok := prunableStreams[streamName]; !ok { | ||
prunableStreams[streamName] = streamNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do this unconditionally
pkg/image/prune/prune.go
Outdated
} | ||
|
||
for streamName, streamNode := range prunableStreams { | ||
glog.V(4).Infof("Checking if ImageStream %s", streamName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking if what?
pkg/image/prune/prune.go
Outdated
errs = append(errs, pruneImageComponents(p.g, p.registryClient, p.registryURL, prunableComponents, layerLinkPruner)...) | ||
errs = append(errs, pruneBlobs(p.g, p.registryClient, p.registryURL, prunableComponents, blobPruner)...) | ||
errs = append(errs, pruneManifests(p.g, p.registryClient, p.registryURL, prunableImageNodes, manifestPruner)...) | ||
|
||
if len(errs) > 0 { | ||
// If we had any errors removing image references from image streams or deleting | ||
// If we had any errors removing image references from deleting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing image references from deleting
?
pkg/image/prune/prune.go
Outdated
glog.V(4).Infof("Checking tag event %d with image %q", i, tagEvent.Image) | ||
|
||
if _, ok := imageNodes[tagEvent.Image]; !ok { | ||
glog.V(4).Infof("Tag event doesn't match deleted image - keeping") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New code will produce MUCH better logs, but I still don't like them :-) They are not greppable.
glog.V(4).Infof("Checking image stream tag %s:%s: event %d with image %q - keeping, doesn't match deleted image", streamName, tag, i, tagEvent.Image)
glog.V(4).Infof("Checking image stream tag %s:%s: event %d with image %q - keeping, within the allowed range of revisions", streamName, tag, i, tagEvent.Image)
glog.V(4).Infof("Checking image stream tag %s:%s: event %d with image %q - removing, matches deleted image", streamName, tag, i, tagEvent.Image)
750b3ef
to
f6baf00
Compare
Signed-off-by: Gladkov Alexey <[email protected]>
f6baf00
to
a3477d9
Compare
a3477d9
to
53f1fdb
Compare
// delete any images. This way, you can rerun prune and retry things that failed. | ||
// If we had any errors deleting layers, blobs, or manifest data from the registry, | ||
// stop here and don't delete any images. This way, you can rerun prune and retry | ||
// things that failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to get smarter about this and still prune the images for which we successfully pruned all the layers/blobs/manifests.
} | ||
|
||
return updatedStream, err | ||
} | ||
|
||
func (p *describingImageStreamDeleter) NotifyImageStreamPrune(stream *imageapi.ImageStream, updatedTags []string, deletedTags []string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
godoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bparees done
@legionus is this tied to a bug/issue? |
pkg/image/prune/prune.go
Outdated
} | ||
|
||
updatedStream, err := streamPruner.UpdateImageStream(stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should know if we actually changed the stream or not, can't we be smarter here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though, on the previous step we've selected all image stream that should be modified. If the stream is not changed here, it means someone else concurrently removes objects from the image stream.
pkg/image/prune/prune.go
Outdated
} | ||
imageID = id | ||
case "DockerImage": | ||
ref, err := imageapi.ParseDockerImageReference(from.Name) | ||
if err != nil { | ||
glog.V(2).Infof("Error parsing DockerImage name %q: %v - skipping", from.Name, err) | ||
return | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone manages to create build config with corrupted image reference, it will prevent the pruner from running. Is this really such an important error?
If it is, let's add some flag (e.g. --force
) marching over small errors, crippled references and dead men.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miminar You can not create such strategy:
# buildconfigs "cakephp-ex" was not valid:
# * spec.strategy.sourceStrategy.from.name: Invalid value: "foo:bar:baz": not a valid Docker pull specification: invalid reference format
Also even if I add such a option, then it will need to specify this option constantly until broken strategy is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the prior docker/distribution rebases made parsing of references stricter. I can easily imagine really old build strategy with a relaxed reference value. I can also imagine this parser becoming even stricter.
Also even if I add such a option, then it will need to specify this option constantly until broken strategy is removed.
If the error is printed, the admin will have the opportunity to act and tune/delete the object. Make sure to include namespace/name of the build config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miminar Done
pkg/image/prune/prune.go
Outdated
func calculatePrunableImages( | ||
g graph.Graph, | ||
imageNodes []*imagegraph.ImageNode, | ||
algorithm pruneAlgorithm) (map[string]*imagegraph.ImageNode, graph.NodeSet) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
func calculatePrunableImages(
...
algorithm pruneAlgorithm,
) (map[string]*imagegraph.ImageNode, graph.NodeSet) {
prunable := make(map[string]*imagegraph.ImageNode)
makes it clearer for eyes where arguments stop and code starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miminar Done
pkg/image/prune/prune.go
Outdated
glog.V(4).Infof("Removing tag %q from status.tags of ImageStream %s", tag, getName(stream)) | ||
delete(stream.Status.Tags, tag) | ||
if _, ok := imageNodes[tagEvent.Image]; !ok { | ||
glog.V(4).Infof("Image stream tag %s:%s revision %d - keeping because image %q is not deleted", streamName, tag, i, tagEvent.Image) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please verify that the image indeed exists (it is contained in the full image group)? And remove it if not (and the tag event is old enough).
Deleted images that still exist in imagestreamtags polute the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miminar Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miminar No, we can't do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we can't do it.
Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miminar What you suggested cannot be implemented in this place because if the user adds a new image, we'll delete as non-existent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miminar I mean that we have to be tolerant to unknown images.
pkg/image/prune/prune.go
Outdated
@@ -75,7 +75,9 @@ type ImageStreamDeleter interface { | |||
GetImageStream(stream *imageapi.ImageStream) (*imageapi.ImageStream, error) | |||
// UpdateImageStream removes all references to the image from the image | |||
// stream's status.tags. The updated image stream is returned. | |||
UpdateImageStream(stream *imageapi.ImageStream, image *imageapi.Image, updatedTags []string) (*imageapi.ImageStream, error) | |||
UpdateImageStream(stream *imageapi.ImageStream) (*imageapi.ImageStream, error) | |||
// NotifyImageStreamPrune shows notification about image stream was updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about updated image stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miminar Done
test/extended/images/helper.go
Outdated
@@ -775,6 +775,13 @@ func (c *CleanUpContainer) Run() { | |||
} | |||
} | |||
|
|||
out, err := c.OC.SetNamespace(metav1.NamespaceDefault).AsAdmin(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a commentary why is this necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And a TODO to replace this with removals of individual blobs used in the test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miminar Done
/unassign @fabianofranz |
3e6fea9
to
7ab0ea0
Compare
@bparees No. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
pkg/image/prune/prune.go
Outdated
if err != nil { | ||
// During prune someone removed imagestream | ||
if kerrapi.IsNotFound(err) { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it's worth of 4th info level at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miminar Done
pkg/image/prune/prune.go
Outdated
streamNode.ImageStream = updatedStream | ||
} | ||
|
||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paranoid mode on: another NotFound
could occur here as well :paranoid mode off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miminar Done
e2523a6
to
c43af16
Compare
Signed-off-by: Gladkov Alexey <[email protected]>
Signed-off-by: Gladkov Alexey <[email protected]>
c43af16
to
eeb7388
Compare
} | ||
imageID = id | ||
case "DockerImage": | ||
ref, err := imageapi.ParseDockerImageReference(from.Name) | ||
if err != nil { | ||
glog.V(2).Infof("Error parsing DockerImage name %q: %v - skipping", from.Name, err) | ||
return | ||
return fmt.Errorf("error parsing DockerImage name %q: %v", from.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I saw agreement to make this a warning, not fail the entire prune process? That would be my inclination.
Same for the ImageStreamImage check above.
If the reference is invalid, it clearly is not going to reference an image and keep us from pruning that image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I saw agreement to make this a warning, not fail the entire prune process? That would be my inclination.
@bparees Yes, but my answer was that you can't create such reference.
If the reference is invalid, it clearly is not going to reference an image and keep us from pruning that image.
You can not be sure that this Strategy is not going to reference an image. You just don't know. I believe that in the conditions when we are not sure we need to save user data. In the end the user will see the error and can decide to remove this build/buildconfig or fix reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bparees Yes, but my answer was that you can't create such reference.
I think relying on that guarantee is a mistake. Bugs in the buildconfig validation could allow the creation of such a reference and then pruning is broken for no good reason.
You can not be sure that this Strategy is not going to reference an image.
I don't care what it may do in the future. I know it's not currently referencing an image. What the user may update it to reference in the future when they fix their typo is irrelevant. It may have tried to reference an image, but it is not referencing an image (it has an invalid reference, so it's not referencing anything).
I believe that in the conditions when we are not sure we need to save user data.
What are we saving? Can you give me a scenario where a user manages to create an image that we'd prune here despite an invalid reference existing in the system, that would ultimately be a problem?
If your argument is "the user meant to reference something but they did it wrong so we have to save everything until we know what they meant to reference", well what if they meant to reference "ruby" and they typed it as "riby"? It's a "valid" reference, so it won't trip this error, but the "ruby" image is going to get cleaned up in that case too. There's only so much protection we can offer.
In the end the user will see the error and can decide to remove this build/buildconfig or fix reference.
The admin running this is not the same as the user who created the broken reference. So your solution requires the admin to either go edit the user's object themselves, or track down the user and discuss how to resolve it. Neither of those are ideal situations for an admin who's just trying to free up space (not to mention that this is frequently run as an automated job that may or may not have good monitoring in place).
In your referenced comment thread you said "done" but it's not obvious to me what that means. You didn't add the suggested "force" flag and you didn't make this a warning. (And to be clear, I prefer the latter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think relying on that guarantee is a mistake. Bugs in the buildconfig validation could allow the creation of such a reference and then pruning is broken for no good reason.
@bparees I'll change it, but it's your decision.
In your referenced comment thread you said "done" but it's not obvious to me what that means
My done
was about:
If the error is printed, the admin will have the opportunity to act and tune/delete the object. Make sure to include namespace/name of the build config.
and I added the ability to see the name of the object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it, but it's your decision.
I asked a pretty direct question:
What are we saving? Can you give me a scenario where a user manages to create an image that we'd prune here despite an invalid reference existing in the system, that would ultimately be a problem?
You seem very concerned about something here, but I haven't grasped what it is. Can you help me understand the scenario you are envisioning where proceeding with pruning is a mistake and why it is a mistake?
I am trying to make an educated decision.... so far I am not seeing the risk in proceeding with pruning in this situation and i'd like you to help me be confident that I am not overlooking something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked a pretty direct question:
@bparees At first I did not understand correctly. Sorry, of course I will try to explain my point of view.
What are we saving?
Since this operation can not be canceled (we send requests to the registry to delete blobs/manifests), then we need to be very careful with the calculation of references. Otherwise, we will permanently delete objects that the user did not want to lose.
If we have references that we can not parse, it means that there is something wrong in the system (incomplete upgrade or something else). In this case, it is not the best time to perform actions that can not be rolled back.
Ignoring "broken" Build/BuildConfig will lead us to the image pruning if there are no other references in the cluster.
Can you give me a scenario where a user manages to create an image that we'd prune here despite an invalid reference existing in the system, that would ultimately be a problem?
Imagine the situation when in the new version of origin we updated the github.com/openshift/origin/pkg/image/reference
in such a way (it can be corner cases) that 10% of the old references stopped parsing. If you ignore the reference parsing error, you will lose 10% of the links in the graph. This will lead to a loss of actual data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine the situation when in the new version of origin we updated the github.com/openshift/origin/pkg/image/reference in such a way (it can be corner cases) that 10% of the old references stopped parsing. If you ignore the reference parsing error, you will lose 10% of the links in the graph. This will lead to a loss of actual data.
Thanks, that is a compelling argument.
Do we at least have enough information to add the namespace containing the bad buildconfig reference? Right now i'm not convinced we're printing enough information to help an admin go find the bad reference and fix it. They essentially would have to dump all the buildconfigs in all the namespaces and then grep for the string. We should do more to help point them directly to where the problem is, in these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we at least have enough information to add the namespace containing the bad buildconfig reference?
@bparees Right now, I output an error to the user and specifying <namespace>/<name>
of the Build/BuildConfig. Is this information enough ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i hadn't seen that you were catching this error and wrapping it w/ additional information. I see it now.
@bparees Also, I should note that |
Hm. I suppose you're getting what you asked for if you do that but I agree it's likely to surprise people. |
@bparees Obviously, the user did not ask us to turn the cluster into a non-consistent state. From our documentation:
and
So we do not have to delete something that is still required by the system. |
but they did ask us to remove all the tags from the system, so long as the associated images are older than a certain age.
sorry, I can't follow this statement. What behavior are you proposing would be correct for "--keep-tag-revisions=0"? |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bparees, dmage, legionus The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
@bparees Let's consider the situation: We have two Images:
ImageStream
One running Pod that points to the Image Now let's run the command: We will make the graph and add the dependences:
So according to the graph, we can remove the Image Now we have a Build that points to This happened because we did not count the reference to the |
hm. It seems like for objects that have a reference to an imagestreamtag, we should be resolving what image that imagestreamtag currently points to, and counting that as a hard reference from the object(buildconfig in this case) to the image. Is there a reason we did not do that? |
@bparees The tag may change during pruning. Since Build points to the tag we must protect this tag from pruning. Otherwise, if the tag changes, we will remove it anyway. Imagine the situation that after the start of the pruning process, someone updated the tag to something that we want to remove.
Moreover, this process with a certain probability can damage the registry storage (if you use S3). I mean, if you're pushing data at the same time, there's a chance to get broken images. |
so do we protect it or not? you said As you say, we're never going to solve 100% of potential race conditions since we don't have a transactional lock on the graph. Someone could also update their buildconfig to point to a tag or image we already selected for pruning, in the middle of a pruning operation. I think (if only for my education) we need to have a level set conversation on what the pruning algorithm is and why. It seems like there is some inconsistency as to whether the primary concern is imagestreamtags, or images. |
btw you have the same problem with a DeploymentConfig that is not currently deployed, I would imagine. You have no hard reference to an image, just a reference from the DeploymentConfig to an ImageStreamTag. Which again makes me think we should be resolving the reference to an image and treating that as a strong reference between the image and the deploymentconfig(or buildconfig). If someone updates the reference it mid-pruning you're still not any worse off than you are today where you ignore the reference entirely. |
/retest |
1 similar comment
/retest |
@bparees No. You can see it here.
We can change the conditions and make it possible. As a crazy idea: Mover the pruning process to the master and block (not reject) write requests while the pruning is in progress. In this case, we guarantee the immutability of the state. Also, we can move blobs/manifests pruning to the registry (we already have hard pruning), rather than sending millions of delete requests, which are rather slow.
@bparees I did not know anything about image pruning process a weeks ago )))
Sure. I do not know of all the places where there can be such tag references, so I listed only builds and buildconfigs. But of course there are certainly more of them.
Sure. It's better to take into account these references than to ignore them completely. Honestly, I do not understand why the graph is used to prune images. There is enough map of images with a reference counter. This will work well only if we are sure that the state of the cluster can not change, but since we ignore this fact, we can not take it into account. |
/test all [submit-queue is verifying that this PR is safe to merge] |
@legionus: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue (batch tested with PRs 16717, 16759, 16756). |
This commits allows to deploy the RDO registry with OpenShift 3.7. It currently uses the "rdo-test" branch because there is still one unmerged pull request upstream that hasn't merged yet upstream. Delta from OpenShift 3.5 that is interesting to us: - Significant improvements for registry and image pruning see [1][2][3][4] - docker-registry can now use reencrypt routes [5] - Metrics and logging were deployed by default in 3.5, this is no longer the case in 3.7, avoiding an unnecessary impact on performance. [6] - We're now deploying a persistent volume for the docker-registry service on the local filesystem. [1]: openshift/origin#13671 [2]: openshift/origin#16717 [3]: openshift/origin#16656 [4]: openshift/origin#17020 [5]: openshift/origin#14249 [6]: openshift/openshift-ansible@660bafe Change-Id: I5c364a1aab883b6af061051bf190ce857bf2e1f9
No description provided.