<!-- release-blog: covers 998813f3, dd879860, ba506cd6, 555423ad, 4f74a5a2, 39ce456d, 5ec276db -->

Docapybara runs on one Linode Nanode (1 GB of RAM, single host) behind Caddy. For the first few weeks, every deploy meant about 40 seconds of 502s while the new container booted. That was fine until people started noticing it. We rebuilt the deploy three times to get the user-visible downtime down to about five seconds — without spinning up Kubernetes.

Here's the arc: SIGHUP, then blue/green container rename, then Caddy dynamic upstream. Each step was correct for the situation it was deployed into, and each one stopped being correct when something else changed.

## The setup

Four containers, one host, one `docker-compose.prod.yml`:

```bash
caddy   → exposes 80/443, terminates TLS via Let's Encrypt, reverse-proxies to django
django  → gunicorn on :8706, the actual web tier
worker  → django-tasks queue worker
db      → postgres:18
```

Caddy resolves the `django` upstream by Docker network alias. The atomic question for every deploy is: how do we move `django` from the old container to the new one without Caddy serving requests to a container that's halfway shut down?

Some constraints we accepted up front:

- **One host.** No load balancer in front of a fleet, no DNS round-robin between two VPSes.
- **One worker per role.** Nanode RAM doesn't fit two of everything.
- **Caddy stays up.** We never restart Caddy as part of a code deploy — TLS handshakes are not a thing we want disturbed.

What we wanted: zero-downtime for the common deploy path (push code, push image, swap containers).

## First attempt: SIGHUP

Gunicorn responds to `SIGHUP` by reloading the Python application without dropping the TCP listener. Old workers finish their requests, new workers boot with the fresh code, and the container PID never changes. As long as you only changed Python source, this is the cheapest deploy in the world.

The trick is making sure SIGHUP actually reaches gunicorn. By default, `command: sh -c "gunicorn ..."` runs the shell as PID 1 and gunicorn as a child — which means signals get eaten by the shell. The fix is one keyword:

```bash
command: sh -c "python manage.py collectstatic --clear --noinput && exec gunicorn webapp.wsgi:application --bind 0.0.0.0:8706 --timeout 300 --workers 1 --threads 4"
```

`exec` replaces the shell with gunicorn, so gunicorn becomes PID 1 and `docker compose kill -s SIGHUP django` actually reloads it.

This was great. Until we changed `pyproject.toml`. SIGHUP only reloads Python; if you've added a dependency or rebuilt the image, the running container still has the old packages. We needed something that could swap the image too.

## Second attempt: blue/green container rename

The next pattern: bring up a second Django container alongside the running one, validate it without exposing it to live traffic, then atomically swap which container holds the `django` network alias.

The mechanic is `docker rename`:

```bash
# Bring up django-new with same image but a separate name
docker compose run -d --no-deps --name django-new django \
    sh -c 'exec gunicorn webapp.wsgi:application --bind 0.0.0.0:8706 ...'

# Validate it from inside the network (no public exposure)
docker compose exec caddy wget -qO- http://django-new:8706/

# Move the `django` alias from old to new
docker network disconnect repo_default django-new
docker network connect --alias django repo_default django-new

# Stop old, rename new into prod's slot
docker rm -f repo-django-1
docker rename django-new repo-django-1
```

This works. The new container boots, gets validated by Caddy from inside the project network, and the alias swap is atomic enough that traffic flips over almost cleanly.

The catch: Caddy was resolving the `django` upstream once at config load and caching it. So even after the alias swap, Caddy kept proxying to the old container's IP for as long as that resolution was cached. The DNS flap window — the gap between "old IP no longer responds" and "Caddy notices and re-resolves" — was variable, sometimes long enough to feel like a small outage.

## Third attempt: Caddy dynamic upstream

The fix is one Caddy directive:

```bash
handle {
    reverse_proxy {
        dynamic a {
            name django
            port 8706
            refresh 5s
        }
    }
}
```

`dynamic a` tells Caddy to re-resolve the upstream via Docker's embedded DNS every five seconds, instead of caching the first answer. Combined with the alias swap, the worst-case downtime collapses to whatever's between two refresh ticks — about five seconds, in practice.

Together, the swap looks like this from a user's perspective:

1. Old `django` container is serving requests. New container is invisible.
2. New container boots, gets validated in-network.
3. Alias swap. For a sub-second window, both containers hold the `django` alias. Docker DNS round-robins; both serve correctly (old has working code, new has the just-validated code).
4. Old container removed. Only new responds to `django` lookups.
5. Caddy's next refresh tick (≤5s away) picks up the new IP. Done.

No Caddy reload, no TLS renegotiation, no 502s in the common case.

## Pre-flight gates

The other half of "zero-downtime" is "don't ship a broken image". The deploy runs two pre-flight checks against the new image *before* it ever joins the network:

```bash
# 1. Settings + URL conf load cleanly
docker compose run --rm --no-deps django python manage.py check --deploy

# 2. Migrations apply against prod DB
docker compose run --rm --no-deps django python manage.py migrate --noinput
```

If either fails, the script aborts before swapping anything. Old containers stay up, the site keeps serving, and the deploy log tells you what to fix.

This caught a real outage in flight. Earlier this month, a commit added `python-frontmatter` to `pyproject.toml`, and the prod image hadn't been rebuilt — Django was boot-looping with `ModuleNotFoundError`. Without the pre-flight gate, the swap would have killed the working old container before noticing. With the gate, the new container fails `check --deploy` and the script aborts cleanly.

There's also a `trap` that removes `django-new` if the script dies between "started new" and "renamed it into prod's slot", so an interrupted deploy doesn't leave orphaned containers polluting the next run's DNS.

## A 90-second post-swap verification

After the rename, the script polls `https://docapybara.com/` for up to 90 seconds:

```bash
deadline=$((SECONDS + 90))
while [ $SECONDS -lt $deadline ]; do
    status=$(curl -sS -o /dev/null -w "%{http_code}" https://docapybara.com/)
    [ "$status" = "200" ] && break
    sleep 3
done
```

If it doesn't go green, the deploy exits non-zero and we know to look at logs. The 90 seconds covers gunicorn cold-start (the new container has just rebuilt the cache), TLS handshake variation, and any request-side latency from the first real request after the swap.

## What we kept simple

No Kubernetes, no service mesh, no separate load balancer. The deploy is about 200 lines of Bash that anyone on the team can read end to end. Things we deliberately didn't build:

- **No staging environment.** Pre-flight gates do the work staging would have done. For a single-user product on a single host, a separate stage adds operational cost without much risk reduction.
- **No multi-host failover.** If the Linode is down, Docapybara is down. We accept this; the SLA is "best effort, single-user product, cheap enough to run that uptime is ours to babysit."
- **No deploy history / rollback CLI.** `git revert` and re-deploy is the rollback. Two minutes.

The deploy script ends with a per-phase timing summary so we can see where time goes:

```bash
=== Deploy phase timings ===
   8s  git pull
  47s  frontend rebuild + rsync
  82s  image build
   3s  pre-flight 1: check --deploy
   9s  migrations
  31s  django-new validation
   2s  django swap (alias + rm + rename)
   5s  worker recreate
  18s  post-swap verify
   1s  provisioning checks
  ────
 206s  total
```

Most of the time is the image build and the frontend rsync — both expected. The interesting number is the swap itself (`2s`) and the post-swap verify (`18s`), which together are the only window where any user might have seen something odd. In practice, it's the 5-second DNS refresh on Caddy's side.

## Where this stops working

Honest about the edges:

- **Schema changes that break old code.** A migration that drops a column will break the still-running old container the moment it applies. We split destructive migrations across two deploys, the standard Django pattern. ([Django docs on schema changes](https://docs.djangoproject.com/en/5.1/topics/migrations/) cover this well.)
- **Long-running requests.** A 5-minute upload in flight when we swap will get cut. We don't have many of those, and `--timeout 300` is generous; we'd need explicit graceful drain to handle this without hitching.
- **Caddyfile changes.** Reloading Caddy to pick up a new directive is a separate path — the script does that *before* the django swap, so a bad Caddy config aborts before any container moves.

If you're running anything more critical than a single-user product on a 1 GB VPS, you probably want a real load balancer in front of a real fleet. For our shape, this is good enough — and reading 200 lines of Bash beats debugging Helm charts at 11pm.

If you're doing similar work, the [docapybara-mcp post](/guides/developers-builders/docapybara-mcp/) has the rest of our infra story — how Claude Code, Claude Desktop, and Cursor talk to Docapybara via MCP. Same single-VPS shape; different problem.