Zero-Downtime Hetzner Deploys with Ansible
Our hosting set up is really simple: a few cloud servers behind a load balancer. Our customers send us more than 1,000 requests/second, and it's important for deploys to have zero downtime. This is how we do it using Hetzner cloud servers, load balancers, and our Ansible script.
Of course, you can skip all of this and read the Ansible playbook yourself.
Deployment Steps
For each server, our deploy has three steps:
- Set the server's /healthcheck endpoint to "unhealthy"
- Wait for the loadbalancer to remove the server from the pool
- Gracefully shutdown and redeploy the server
What follows are code snippets for each of these steps.
Set the server's /healthcheck to unhealthy
At the start our Ansible deploy, we just upload a file called "unhealthy" to the server. If that file is present, then the server will return a 500 error on its healthcheck.
- name: Set server status to unhealthy
copy:
content: ""
dest: ./unhealthy
And the corresponding Go code. We're using the Fiber web framework.
func HealthCheck(c *fiber.Ctx) error {
_, err := os.Stat("./unhealthy")
if !os.IsNotExist(err) {
return fiber.ErrBadGateway
}
return c.SendString("ok")
}
Wait for the load balancer to stop sending traffic
The tricky part was #2 - there is not a straightforward Ansible module to wait for the load balancer to stop sending traffic to the, node being deployed. But we figured it out!
This makes use of two Hetzner API endpoints:
- Server Metadata. This endpoint runs on the server
and returns information about itself: ip address, hostname, etc. In our case, we just wanted the
instance-id
, which is Hetzner's unique identifier for a server. - Get Load Balancer. This returns
information about a load balancer, including each of its nodes and healthcheck status. We need the
instance-id
from step 1 to check the status in step 2.
To get the instance-id
, we run this Ansible stanza:
- name: Get Hetzner server ID
uri:
url: "http://169.254.169.254/hetzner/v1/metadata/instance-id"
return_content: true
register: server_id
Now, this is the magic to wait for the LB to mark our server as unhealthy. This is important because we want to make sure we aren't sending any traffic before we deploy.
- name: Wait for Hetzner to remove node from LB
uri:
url: "https://api.hetzner.cloud/v1/load_balancers/{{hetzner_lb_id}}"
return_content: true
headers:
Authorization: "Bearer {{hetzner_api_key}}"
register: lb_response
until: >-
(
lb_response.json.load_balancer.targets
| selectattr('server.id', '==', server_id.content | int)
).0.health_status.0.status == 'unhealthy'
retries: 5
delay: 5
I couldn't have gotten this without help from this StackOverflow answer. It is tricky, but here is how it works:
- Fetches the status of the load balancer and stores the JSON API response into
lb_response
- Uses Jinja's selectattr filter to parse out the status of the specific server being deployed.
- Checks to see if that status is unhealthy.
- If this doesn't happen after 5 retries, cancel the deploy
From here, we can safely shut down the server and proceed with our deploy.
Conclusion
We've been happy with Hetzner's servers, which are the best value options we can find on the market today. With a little bit of cleverness we're able to deploy our service, one node at a time, without any interruption in traffic.