# Streaming a terabyte without buffering it

> How the bkpdb agent moves a Postgres dump from a running database into encrypted object storage in constant memory, and the traps that almost broke it along the way.
> How the bkpdb agent streams a Postgres dump to S3 in constant memory, and the traps that almost broke the pipeline along the way.

URL: https://bkpdb.com/blog/streaming-a-terabyte/
Date: 2026-05-18
Author: Rahul
Section: Engineering

---


When you back up a 12 GB database, every approach works. You can `pg_dump` to a tempfile, wait for gzip to finish, and then `aws s3 cp` the result. It leaves a copy on disk you have to remember to delete, but it finishes.

When you back up a 1.2 TB database, only one approach works.

The agent is a single Go binary that runs on your database host. It does not write the backup to disk at any point, and it does not buffer it in memory. Bytes leave Postgres, pass through zstd compression and an age encryption envelope, and land in your bucket as S3 multipart parts. The agent holds a small, fixed amount of state along the way.

This is a post about how that pipeline is shaped, and the traps you fall into the first time you try to run a long-lived streaming subprocess from inside a Go program.

{{< stub "The naïve version, and why it broke" >}}

## The naïve version, and why it broke

The first version of the agent did the obvious thing. It opened a tempfile, told `pg_dump` to write to it, waited for the process to exit, then streamed the file to S3.

This works until your test database stops being small. We ran it against a synthetic fixture sized like a real production database and the host ran out of disk halfway through. We doubled the staging volume. The next run finished, but the upload took almost as long as the dump, and the backup window was roughly twice the time the database itself spent producing data.

The obvious fix is to pipe `pg_dump` through `zstd` into `aws s3 cp -`, the way every shell script on the internet does. We tried that, and then we ran it against a fixture full of `bytea` columns containing already-compressed images, which is to say data that does not compress further. The pipeline produced about 800 GB of compressed output. The AWS CLI on that host buffered stdin to `/tmp` before starting the multipart upload. The host filled up, the dump aborted, and the partial upload sat in the bucket charging storage until we cleaned it up.

The lesson was not that shell pipelines are bad. The lesson was that "streaming" is a property of every stage of the pipeline at once. If any one stage decides to buffer, you do not have a streaming backup. You have a tempfile in a costume.

## What streaming means inside a single process

The pipeline has four stages: `pg_dump` produces bytes on its stdout, zstd compresses them, age encrypts them, and an S3 part uploader sends them to the bucket in 16 MB chunks. In Go, each stage is an `io.Writer`. Each writer forwards bytes to the next writer's `Write`. No writer is allowed to retain unbounded state.

{{< figure caption="Fig. 1, the writer chain. Each `io.Writer` consumes what the previous stage produced." >}}
<pre><code class="language-go">s3w := newS3PartWriter(ctx, bucket, key, 16*1024*1024)
agew, _ := age.Encrypt(s3w, recipient)
zstdw, _ := zstd.NewWriter(agew, zstd.WithEncoderLevel(zstd.SpeedDefault))

cmd := exec.CommandContext(ctx, "pg_dump",
    "--format=custom",
    "--no-owner", "--no-acl",
    "--dbname=db_prod_us_east",
)
cmd.Stdout = zstdw
cmd.Stderr = newBoundedTail(64 * 1024) // see below
</code></pre>
{{< /figure >}}

That is the whole shape. Four writers stacked on top of each other, with `pg_dump` writing into the top of the stack through the operating system's stdout pipe. No `io.Copy` from a buffer, no goroutine fanout. The thing pacing the pipeline is Postgres itself.

{{< plate title="The pipeline, in situ" num="Fig. 2" >}}
   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
   │ pg_dump  │──▶│   zstd   │──▶│   age    │──▶│  S3 part │
   │  stdout  │   │ encoder  │   │ encrypt  │   │ uploader │
   └──────────┘   └──────────┘   └──────────┘   └──────────┘
        │              │              │               │
        ▼              ▼              ▼               ▼
    bytes out      compressed     encrypted      16 MB parts
                                                   to bucket
                  ◀──────── backpressure ────────
{{< /plate >}}

## Backpressure flows the wrong direction

The picture in your head when you first write a pipeline like this is that Postgres produces bytes, every downstream stage is faster than the one before it, and the bytes flow downhill.

The picture in reality is the opposite. When S3 is slow, the part uploader blocks inside `Write`. When the uploader blocks, age blocks. When age blocks, zstd blocks. When zstd blocks, the kernel pipe buffer for `pg_dump`'s stdout fills up (usually 64 KB on Linux). When that pipe fills, `pg_dump` blocks on its next `write(2)`. The whole chain self-paces against the slowest link.

This is the property you want. Nothing is buffered anywhere unbounded. If S3 has a bad minute, the dump runs slower for a minute. If S3 returns 500s for long enough to exhaust the retry budget, the dump fails and we retry from scratch. (There is no resumable mode yet. It is on the roadmap.)

{{< pullquote attrib="the property of a writer chain you need to internalise" >}}Streaming is a property of every stage of the pipeline at once. If any one stage decides to buffer, you do not have a streaming backup. You have a tempfile in a costume.{{< /pullquote >}}

The one place we had to be deliberate is the S3 stage. The default `s3manager.Uploader` from the AWS Go SDK uploads parts concurrently, which is good for throughput but turns the bottom of the pipeline back into a buffer of `PartSize * Concurrency` bytes in flight. We set concurrency to 4 and part size to 16 MB, which caps in-flight state at 64 MB and is a reasonable trade against round-trip latency to the bucket.

## The stderr trap

Here is the bug we built in defences against from day one, because we had been bitten by it on a previous Go service we ran internally.

That earlier service ran a long-lived subprocess and captured its stderr into a `bytes.Buffer`. It worked fine for months. Then one night the subprocess went into a verbose retry loop, printed a one-line warning every few milliseconds for six hours, and the parent process climbed from ten megabytes of memory to almost two gigabytes before the OOM killer reaped it.

The code that did it is the code anyone writes the first time they call `exec.Cmd`.

{{< figure caption="Fig. 3, the line of code that cost us a server, once." >}}
<pre><code class="language-go">var stderr bytes.Buffer
cmd.Stderr = &stderr
</code></pre>
{{< /figure >}}

`pg_dump` is a polite tool. On most databases it prints a few kilobytes of stderr over a run, and `bytes.Buffer` is fine. On a database with thousands of small tables, with verbose mode accidentally enabled by an operator debugging something else, it can produce gigabytes. So we did not write that code this time.

{{< figure caption="Fig. 4, a bounded tail. Keeps the last 64 KB of stderr, drops everything older." >}}
<pre><code class="language-go">type boundedTail struct {
    buf []byte
    max int
}

func (b *boundedTail) Write(p []byte) (int, error) {
    if len(p) >= b.max {
        b.buf = append(b.buf[:0], p[len(p)-b.max:]...)
        return len(p), nil
    }
    if len(b.buf)+len(p) > b.max {
        drop := len(b.buf) + len(p) - b.max
        b.buf = b.buf[drop:]
    }
    b.buf = append(b.buf, p...)
    return len(p), nil
}
</code></pre>
{{< /figure >}}

We keep the last 64 KB of stderr and throw the rest away. If the process exits non-zero, that tail is what we attach to the failure record. In practice it is always enough to identify what went wrong.

One thing worth knowing about subprocess stderr in Go: there is no way to ignore it cleanly. If you leave `cmd.Stderr` as nil and never drain it, the pipe eventually fills and the child blocks on its next write to stderr. You have two choices: read it and keep some, or read it and throw it away. Pick one.

## The two numbers that almost bit us

Two settings in this pipeline matter more than they look like they should.

The zstd encoder in `klauspost/compress` defaults to a window size that scales with the compression level. At level 11 the window is 128 MB per worker. We had been benchmarking on a laptop with 32 GB of RAM and quietly defaulting to four workers at level 7. On the agent host, which is meant to be cheap to run alongside your database, that would have reserved 256 MB before doing any work. We pinned the encoder to one worker at level 3, with a fixed 4 MB window. The marginal compression savings from larger windows on Postgres custom-format output are small enough not to be worth the memory footprint.

The S3 part size matters for a different reason. Multipart uploads have a hard ceiling of 10,000 parts per object. A 16 MB part size puts you at 160 GB. A 64 MB part size puts you at 640 GB. The SDK does not warn you when you are close; it just fails the last `UploadPart` with `EntityTooLarge`. We size parts dynamically based on the size of the previous backup for that database, starting at 16 MB and stepping up. There is no science to the constants. There is only "do not hit 10,000."

## What we still do not handle well

The whole pipeline retries from the beginning. If your dump fails ten hours in, we start over. We do not yet have a way to checkpoint. There are designs for it, most of them built on the parallel format that `pg_dump` supports with `-j`, where each table is naturally its own chunk. It is on the roadmap.

We also have no good answer for databases where the dump is faster than the network can carry it. `pg_dump` paces against the upload, which means it holds a transaction open longer than it would in isolation, which means `AccessShareLock` on every table in the dump for the duration. On a busy primary that blocks `ALTER TABLE`. The workaround is to dump from a read replica, which is what we recommend.

Neither of these is a streaming problem in the strict sense. They are the next two problems you find once you have solved the streaming problem.

That is the whole thing. Four writers in a row, one bounded buffer for stderr, and a strict rule about not buffering. Most of the work was finding the places where we had accidentally broken the rule.

Stuck on something in your own pipeline, or want to compare notes? Drop into our Discord. The stderr bug is exactly the kind of thing we like to hear about.

