Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: archive/tar: support zero-copy reading/writing #70807

Open
hanwen opened this issue Dec 12, 2024 · 15 comments
Open

proposal: archive/tar: support zero-copy reading/writing #70807

hanwen opened this issue Dec 12, 2024 · 15 comments
Labels
Milestone

Comments

@hanwen
Copy link
Contributor

hanwen commented Dec 12, 2024

Proposal Details

the container ecosystem (podman,docker) spends its days creating and consuming huge .tar files. There is potential for significant speed-up here by having the tar package use zero-copy file transport.

The change is straightforward, but involves an API change, so opening a proposal.

with the following change, tarring up a 2G file from tmpfs to tmpfs goes from 2.0s to 1.3s

diff -u /home/hanwen/vc/go/src/archive/tar/writer.go hacktar/writer.go
--- /home/hanwen/vc/go/src/archive/tar/writer.go	2024-08-22 14:56:29.586690369 +0200
+++ hacktar/writer.go	2024-12-12 15:01:22.150045055 +0100
@@ -9,6 +9,7 @@
 	"fmt"
 	"io"
 	"io/fs"
+	"log"
 	"path"
 	"slices"
 	"strings"
@@ -491,7 +492,7 @@
 //
 // TODO(dsnet): Re-export this when adding sparse file support.
 // See https://golang.org/issue/22735
-func (tw *Writer) readFrom(r io.Reader) (int64, error) {
+func (tw *Writer) ReadFrom(r io.Reader) (int64, error) {
 	if tw.err != nil {
 		return 0, tw.err
 	}
@@ -550,6 +551,16 @@
 }
 
 func (fw *regFileWriter) ReadFrom(r io.Reader) (int64, error) {
+	log.Println("hanwen")
+	if _, ok := fw.w.(io.ReaderFrom); ok {
+		n, err := io.Copy(fw.w, r)
+		if n > fw.nb {
+			return n, fmt.Errorf("read %d bytes, beyond max %d", n, fw.nb)
+		}
+		fw.nb -= n
+		return n, err
+	}
+
 	return io.Copy(struct{ io.Writer }{fw}, r)
 }
 
@gopherbot gopherbot added this to the Proposal milestone Dec 12, 2024
@hanwen
Copy link
Contributor Author

hanwen commented Dec 12, 2024

A similar optimization exists for the reading side, of course.

@hanwen hanwen changed the title proposal: archive/tar: support zero-copy writing proposal: archive/tar: support zero-copy reading/writing Dec 12, 2024
@ianlancetaylor
Copy link
Member

Just to spell it out, I believe that the API change here is to define a new method on archive/tar.Writer:

// ReadFrom implements [io.ReaderFrom].
func (tw *Writer) readFrom(r io.Reader) (int64, error)

Note that I think you could get a similar effect without the API change by writing

	if tw, ok := fw.w.(*Writer); ok {
		return tw.readFrom(r)
	}

CC @dsnet

@hanwen
Copy link
Contributor Author

hanwen commented Dec 13, 2024

your suggestion certainly improves regFileWriter.ReadFrom, but nobody calls that unless Writer.ReadFrom is exported. Am I missing something?

@hanwen
Copy link
Contributor Author

hanwen commented Dec 13, 2024

for the reader, this works.

 // TODO(dsnet): Re-export this when adding sparse file support.
 // See https://golang.org/issue/22735
-func (tr *Reader) writeTo(w io.Writer) (int64, error) {
+func (tr *Reader) WriteTo(w io.Writer) (int64, error) {
 	if tr.err != nil {
 		return 0, tr.err
 	}
@@ -688,6 +688,12 @@
 }
 
 func (fr *regFileReader) WriteTo(w io.Writer) (int64, error) {
+	_, ok1 := fr.r.(io.WriterTo)
+	wrf, ok2 := w.(io.ReaderFrom)
+	if ok1 && ok2 {
+		return wrf.ReadFrom(&io.LimitedReader{R: fr.r, N: fr.nb})
+	}
 	return io.Copy(w, struct{ io.Reader }{fr})
 }
 

@ianlancetaylor
Copy link
Member

It's probably me that was missing something.

@mvdan
Copy link
Member

mvdan commented Dec 14, 2024

cc @dsnet given the TODO above

@Jorropo
Copy link
Member

Jorropo commented Dec 14, 2024

Do we want to include logic to pad out the tar to align content files to the destination's blocksize if it's an *os.File in the writer ?
Performance improvements go from single digit x improvement to multiple thousands through reflink at the cost of making the exact bytes of the tar dependent on the output.

Tar natively pad to 512 :'(

blockSize = 512 // Size of each block in a tar stream

@hanwen
Copy link
Contributor Author

hanwen commented Dec 15, 2024

@Jorropo - Fascinating insight, thanks! I can confirm that on btrfs, if I set the blockSize to 4096, I can write a 2G tar file in 0.08s which is amazing.

Unfortunately, it appears that the block size is not variable in the tar format, so this needs to be done in a different way. Fortunately, one could simply add as many empty files to pad out the tar file to 4096 or whatever the destination block size is. This can be done without changing the tar package at all.

@Jorropo
Copy link
Member

Jorropo commented Dec 15, 2024

I am not suggesting we change that field, 512 is hardcoded part of the tar format.
If we want to do this we might be able to figure out a way to inject no-op fields parsers will ignore in order to bump in the correct 512 bucket to be 4KiB (or whatever) alligned.

@hanwen-flow
Copy link

If we want to do this

having the bytes generated depend on a non-obvious property of the destination sounds confusing to me; I think we wouldn't want to do this. Regardless, it should go into its own proposal. Let's not side-track this discussion.

@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals Dec 18, 2024
hanwen-flow added a commit to hanwen-flow/podman that referenced this issue Jan 9, 2025
The Go standard library supports copy_file_range for I/O between
files. For most file systems, this is nice speed up, because it avoids
copying data.  For certain filesystems, like BRTFS and XFS, CoW causes
such writes to be metadata-only, speeding them up by 10x or so.

In principle, uncompressed tar files can be read and written using
copy_file_range, as the file format does not checksum the data (see
golang/go#70807). Currently, podman makes
this impossible, as the archive package liberally inserts buffers and
pipes.

This commit marks a couple of places that should be revisited.
@hanwen
Copy link
Contributor Author

hanwen commented Jan 15, 2025

A change with some numbers at https://go-review.googlesource.com/c/go/+/642736. For tmpfs and ext4, it yields a 10-20% speed improvement.

I have trouble measuring consistent results with BTRFS on Fedora 41. A standalone program can seemingly copy around large files no time at all (speeds of ~1000 Gb/s), but these aren't reflected in the benchmarks. Despite this, strace shows copy_file_range being executed in the same way between the standalone program and the benchmark.

@Jorropo

This comment has been minimized.

@hanwen

This comment has been minimized.

@Jorropo

This comment has been minimized.

@hanwen
Copy link
Contributor Author

hanwen commented Jan 15, 2025

it looks like btrfs needs to have fsync called on the file, for it to be eligible for metadata only copies. Adding src.Sync() to the writeLargeFile function yields

2025/01/15 17:39:27 wrote input0 (1310720000 byte) in 1.64345491s: 0.742766 gb/s
2025/01/15 17:39:27 copy input0 -> output0: 1310720000 bytes in 483.089µs: 2526.870049 gb/sec
2025/01/15 17:39:29 wrote input0 (1310720000 byte) in 1.873767098s: 0.651470 gb/s
2025/01/15 17:39:29 copy input0 -> output0: 1310720000 bytes in 310.081µs: 3936.723388 gb/sec
2025/01/15 17:39:30 copy input0 -> output-2-0: 1310720000 bytes in 137.397µs: 8884.496204 gb/sec

hanwen-flow added a commit to hanwen-flow/podman that referenced this issue Jan 16, 2025
This inserts dummy headers (empty name, char device 0,0) so files
start at a given blocksize. This will allow the tar package to execute
file copies in the kernel. For filesystems like XFS and BTRFS, this
allows for reflink copying, which effectively makes copying data free.

See golang/go#70807 for background.
hanwen pushed a commit to hanwen-flow/podman that referenced this issue Jan 18, 2025
This inserts dummy headers (empty name, char device 0,0) so files
within the tar archive start at Stat_t.Blksize boundaries. This is a
precondition for using reflink to share data blocks in BTRFS and XFS.

See golang/go#70807 for background.
hanwen-flow added a commit to hanwen-flow/podman that referenced this issue Jan 28, 2025
This inserts dummy headers (empty name, char device 0,0) so files
within the tar archive start at Stat_t.Blksize boundaries. This is a
precondition for using reflink to share data blocks in BTRFS and XFS.

See golang/go#70807 for background.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

6 participants