Migrating xoxo.zone to OVHcloud
Mastodon hosted on xoxo.zone is now living on its new server at OVHcloud. We were hosted at Hetzner for just about 2 years, but draconian terms of service and some scary experiences for other communities made the move inevitable. I evaluated a few EU-based hosts, including Netcup and Scaleway. OVH hit the sweet spot of pricing, specs, and terms of use.
We built this shitty
Due to a miscalculation, the server I provisioned on Hetzner used spinning rust HDDs instead of SSDs. This gave us a ton of storage overhead we didn't need, and was also slow as fuck and made many simple things very painful. Even git operations and restarting services could take minutes instead of seconds.
The upgrade to Mastodon v4.2.0 last October was particularly painful. I first upgraded the server from Ubuntu 20.04 to Ubuntu 22.04, and planned to keep going to Ubuntu 24.04. This required a PostgreSQL upgrade from v15 to v17. I started this upgrade at 15:00, and gave up for the day when the database finally finished rebuilding at 01:30. The server was down the whole time. Bummer.
A lot of effort went into making this look effortless
A challenge of running a server like this is needing to know a little bit about everything. It was clear to me that I could do better than 10.5 hours of downtime, but I wasn't sure how to get there. I've done a lot of reading about PostgreSQL migration strategies since October.
The server database is backed up twice a day. The pg_dump
takes about 2 hours,
and the upload is another half hour, to say nothing of restoring the db on a new
host. Not awesome.
A test rsync --checksum
of the database took about 80 minutes, even for
subsequent rsyncs that (in theory) had to transfer less data.
Replication was daunting, but I stuck with it. In the end it was pretty painless and worked extremely well.
I created a replication role on the old server, xoxo-4:
CREATE ROLE xoxo5 WITH REPLICATION PASSWORD 'secret_password' LOGIN;
And I updated the access rules in /etc/postgresql/17/main/pg_hba.conf
:
# Allow replication from xoxo5@10.0.0.5
host replication xoxo5 10.0.0.5/32 scram-sha-256
On the new host, xoxo-5, I emptied out the /var/lib/postgresql/17/main
directory and enabled replication:
sudo -u postgres pg_basebackup -h 10.0.0.4 -p 5432 -U xoxo5 -D /var/lib/postgresql/17/main/ -Fp -Xs -R
This took a few hours but it was worth every second. Once the backup was done,
the results were extremely promising: select * from pg_stat_replication
and
pgmetrics showed delays in the milliseconds, and spot checking counts in
the statuses
and accounts
tables looked good (other than count(*)
taking
over 10 minutes on xoxo-4).
In theory the hardest, slowest part was done: the 88GB database was ready for cutover whenever we were.
Not zero-downtime but I remain chuffed
I was emboldened by successful replication and from listening to Eurovision playlists at high volume for the previous hour. After some encouraging words like "why not" and "if you fuck this up maybe i can focus on work," I finalised a migration plan and kicked things off.
I had done a lot of work already at this point:
- I had run the server setup guide and finished an initial rsync on some key directories, including the nginx config
- A bunch of server config, including Mastodon service files, backup configuration, and crons are in an Ansible playbook, which I had already run
- The domain TTL was already ramped down to 60 seconds
Winding down xoxo-4 looked like this:
mastodon-bounce stop all # wrapper script that runs systemctl on all mastodon services
mastodon-bounce disable all
systemctl disable --now redis-server.service
Bringing things back up on xoxo-5 looked like this:
/root/sync.sh # /home/mastodon/live, /var/lib/redis, /etc/letsencrypt, /etc/nginx
pg_ctlcluster 17 main promote
systemctl enable --now redis-server.service
mastodon-bounce start all
mastodon-bounce enable all
RAILS_ENV=production ./bin/tootctl feeds build
RAILS_ENV=production ./bin/tootctl search deploy
I would also run a few SQL commands to check data consistency:
sudo -u postgres psql -c '\x' -c 'select * from pg_stat_replication'
time sudo -u mastodon psql mastodon_production -c "select count(1) from statuses"
time sudo -u mastodon psql mastodon_production -c "select id, created_at from statuses order by created_at desc limit 10"
time sudo -u mastodon psql mastodon_production -c "select count(1) from accounts"
Job's done
The migration took about 15 minutes from start to finish. Next time, with SSD hosts on both sides, I could probably get it down to seconds or maybe even zero downtime.
I did hit one snag that was obfuscated by caching in the Mastodon service
worker: the mastodon account's home directory was created with permissions
0750
, and Nginx could not read files in the web directory, causing a lot of
busted client pages for about 30 minutes after I "finished" the migration.
There's always something.
But still! It's done and I'm happy with how it went.
Let's not do it again for a long time.