<!-- release-blog: covers 998813f3, dd879860, ba506cd6, 555423ad, 4f74a5a2, 39ce456d, 5ec276db -->
Docapybara runs on one Linode Nanode (1 GB of RAM, single host) behind Caddy. For the first few weeks, every deploy meant about 40 seconds of 502s while the new container booted. That was fine until people started noticing it. We rebuilt the deploy three times to get the user-visible downtime down to about five seconds — without spinning up Kubernetes.
Here's the arc: SIGHUP, then blue/green container rename, then Caddy dynamic upstream. Each step was correct for the situation it was deployed into, and each one stopped being correct when something else changed.
The setup
Four containers, one host, one docker-compose.prod.yml:
caddy → exposes 80/443, terminates TLS via Let's Encrypt, reverse-proxies to django
django → gunicorn on :8706, the actual web tier
worker → django-tasks queue worker
db → postgres:18
Caddy resolves the django upstream by Docker network alias. The atomic question for every deploy is: how do we move django from the old container to the new one without Caddy serving requests to a container that's halfway shut down?
Some constraints we accepted up front:
- One host. No load balancer in front of a fleet, no DNS round-robin between two VPSes.
- One worker per role. Nanode RAM doesn't fit two of everything.
- Caddy stays up. We never restart Caddy as part of a code deploy — TLS handshakes are not a thing we want disturbed.
What we wanted: zero-downtime for the common deploy path (push code, push image, swap containers).
First attempt: SIGHUP
Gunicorn responds to SIGHUP by reloading the Python application without dropping the TCP listener. Old workers finish their requests, new workers boot with the fresh code, and the container PID never changes. As long as you only changed Python source, this is the cheapest deploy in the world.
The trick is making sure SIGHUP actually reaches gunicorn. By default, command: sh -c "gunicorn ..." runs the shell as PID 1 and gunicorn as a child — which means signals get eaten by the shell. The fix is one keyword:
command: sh -c "python manage.py collectstatic --clear --noinput && exec gunicorn webapp.wsgi:application --bind 0.0.0.0:8706 --timeout 300 --workers 1 --threads 4"
exec replaces the shell with gunicorn, so gunicorn becomes PID 1 and docker compose kill -s SIGHUP django actually reloads it.
This was great. Until we changed pyproject.toml. SIGHUP only reloads Python; if you've added a dependency or rebuilt the image, the running container still has the old packages. We needed something that could swap the image too.
Second attempt: blue/green container rename
The next pattern: bring up a second Django container alongside the running one, validate it without exposing it to live traffic, then atomically swap which container holds the django network alias.
The mechanic is docker rename:
# Bring up django-new with same image but a separate name
docker compose run -d --no-deps --name django-new django \
sh -c 'exec gunicorn webapp.wsgi:application --bind 0.0.0.0:8706 ...'
# Validate it from inside the network (no public exposure)
docker compose exec caddy wget -qO- http://django-new:8706/
# Move the `django` alias from old to new
docker network disconnect repo_default django-new
docker network connect --alias django repo_default django-new
# Stop old, rename new into prod's slot
docker rm -f repo-django-1
docker rename django-new repo-django-1
This works. The new container boots, gets validated by Caddy from inside the project network, and the alias swap is atomic enough that traffic flips over almost cleanly.
The catch: Caddy was resolving the django upstream once at config load and caching it. So even after the alias swap, Caddy kept proxying to the old container's IP for as long as that resolution was cached. The DNS flap window — the gap between "old IP no longer responds" and "Caddy notices and re-resolves" — was variable, sometimes long enough to feel like a small outage.
Third attempt: Caddy dynamic upstream
The fix is one Caddy directive:
handle {
reverse_proxy {
dynamic a {
name django
port 8706
refresh 5s
}
}
}
dynamic a tells Caddy to re-resolve the upstream via Docker's embedded DNS every five seconds, instead of caching the first answer. Combined with the alias swap, the worst-case downtime collapses to whatever's between two refresh ticks — about five seconds, in practice.
Together, the swap looks like this from a user's perspective:
- Old
django container is serving requests. New container is invisible.
- New container boots, gets validated in-network.
- Alias swap. For a sub-second window, both containers hold the
django alias. Docker DNS round-robins; both serve correctly (old has working code, new has the just-validated code).
- Old container removed. Only new responds to
django lookups.
- Caddy's next refresh tick (≤5s away) picks up the new IP. Done.
No Caddy reload, no TLS renegotiation, no 502s in the common case.
Pre-flight gates
The other half of "zero-downtime" is "don't ship a broken image". The deploy runs two pre-flight checks against the new image before it ever joins the network:
# 1. Settings + URL conf load cleanly
docker compose run --rm --no-deps django python manage.py check --deploy
# 2. Migrations apply against prod DB
docker compose run --rm --no-deps django python manage.py migrate --noinput
If either fails, the script aborts before swapping anything. Old containers stay up, the site keeps serving, and the deploy log tells you what to fix.
This caught a real outage in flight. Earlier this month, a commit added python-frontmatter to pyproject.toml, and the prod image hadn't been rebuilt — Django was boot-looping with ModuleNotFoundError. Without the pre-flight gate, the swap would have killed the working old container before noticing. With the gate, the new container fails check --deploy and the script aborts cleanly.
There's also a trap that removes django-new if the script dies between "started new" and "renamed it into prod's slot", so an interrupted deploy doesn't leave orphaned containers polluting the next run's DNS.
A 90-second post-swap verification
After the rename, the script polls https://docapybara.com/ for up to 90 seconds:
deadline=$((SECONDS + 90))
while [ $SECONDS -lt $deadline ]; do
status=$(curl -sS -o /dev/null -w "%{http_code}" https://docapybara.com/)
[ "$status" = "200" ] && break
sleep 3
done
If it doesn't go green, the deploy exits non-zero and we know to look at logs. The 90 seconds covers gunicorn cold-start (the new container has just rebuilt the cache), TLS handshake variation, and any request-side latency from the first real request after the swap.
What we kept simple
No Kubernetes, no service mesh, no separate load balancer. The deploy is about 200 lines of Bash that anyone on the team can read end to end. Things we deliberately didn't build:
- No staging environment. Pre-flight gates do the work staging would have done. For a single-user product on a single host, a separate stage adds operational cost without much risk reduction.
- No multi-host failover. If the Linode is down, Docapybara is down. We accept this; the SLA is "best effort, single-user product, cheap enough to run that uptime is ours to babysit."
- No deploy history / rollback CLI.
git revert and re-deploy is the rollback. Two minutes.
The deploy script ends with a per-phase timing summary so we can see where time goes:
=== Deploy phase timings ===
8s git pull
47s frontend rebuild + rsync
82s image build
3s pre-flight 1: check --deploy
9s migrations
31s django-new validation
2s django swap (alias + rm + rename)
5s worker recreate
18s post-swap verify
1s provisioning checks
────
206s total
Most of the time is the image build and the frontend rsync — both expected. The interesting number is the swap itself (2s) and the post-swap verify (18s), which together are the only window where any user might have seen something odd. In practice, it's the 5-second DNS refresh on Caddy's side.
Where this stops working
Honest about the edges:
- Schema changes that break old code. A migration that drops a column will break the still-running old container the moment it applies. We split destructive migrations across two deploys, the standard Django pattern. (Django docs on schema changes cover this well.)
- Long-running requests. A 5-minute upload in flight when we swap will get cut. We don't have many of those, and
--timeout 300 is generous; we'd need explicit graceful drain to handle this without hitching.
- Caddyfile changes. Reloading Caddy to pick up a new directive is a separate path — the script does that before the django swap, so a bad Caddy config aborts before any container moves.
If you're running anything more critical than a single-user product on a 1 GB VPS, you probably want a real load balancer in front of a real fleet. For our shape, this is good enough — and reading 200 lines of Bash beats debugging Helm charts at 11pm.
If you're doing similar work, the docapybara-mcp post has the rest of our infra story — how Claude Code, Claude Desktop, and Cursor talk to Docapybara via MCP. Same single-VPS shape; different problem.