Why Now
pve005 was an i5-7500 with 16GB of RAM. It ran the original Jellyfin LXC with 1.1TB of media on a local ZFS pool. Once I rebuilt the lab around a 3-node Ryzen 9 cluster with Ceph storage, pve005 became dead weight. The media was migrated to jellyfin01 on the new cluster months ago. The old LXC was stopped. pve005 was still drawing power, still in the Ceph quorum, and still showing up in every Ansible run.
Time to pull the plug.
The Starting State
pve005 was part of a 6-node legacy Proxmox 8 cluster (pve001-006) with a 5-node Ceph pool. It was running:
- Ceph OSD.4 (930GB SSD, part of the legacy Ceph cluster)
- Ceph monitor, manager, and metadata server
- NUT client (network UPS monitoring)
- Netdata agent
- Tailscale
No VMs, no cron jobs. Just the stopped jellyfin LXC (3001) with a NAT redirect forwarding old traffic to the new server.
Step 1: Drain the OSD
I had already marked OSD.4 as out a week earlier to let the data rebalance at its own pace. Checking the OSD status confirmed it was fully drained: zero bytes, zero placement groups.
| |
OSD.4 showed 0B across the board. The cluster was HEALTH_OK with data evenly distributed across the remaining four OSDs. Safe to remove.
Stopped the daemon, purged the OSD, and cleaned up the CRUSH map:
| |
The --yes-i-really-mean-it flag is one of those Ceph things. It removes the OSD from the CRUSH map, deletes its auth keys, and wipes its entry from the OSD map in one shot.
Step 2: Remove the Monitor, Manager, and MDS
With the OSD gone, pve005 was still a Ceph monitor (quorum member), manager (standby), and metadata server (standby). Removing these drops the cluster from 5 monitors to 4, which still has quorum (needs 3 of 4).
The monitor was already gone when I checked. Must have been removed during an earlier maintenance window I forgot about. The manager and MDS needed manual cleanup:
| |
After each removal, ceph status stayed HEALTH_OK. Four monitors, three manager standbys, three MDS standbys. The cluster barely noticed.
Step 3: Leave the Corosync Cluster
This is the Proxmox-level removal. First, stop cluster services on pve005 so it leaves the ring:
| |
Then from another node (pve001):
| |
The command printed Could not kill node (error = CS_ERR_NOT_EXIST) because the node was already gone from the perspective of corosync. The cluster config version bumped from 16 to 17, and pvecm status showed 5 nodes.
Cleaned up the leftover node config:
| |
That directory had the old jellyfin LXC config (3001.conf), SSL certs, and some cluster state. All irrelevant now.
Step 4: Clean Up Everything Else
The remaining cleanup was quick:
- Ansible inventory: commented out pve005 from the
nut_netclientsgroup - Tailscale admin panel: removed the machine
- Netdata Cloud: removed the node
- Power: shut down via SSH, then physically at the rack
What Surprised Me
Ceph is remarkably graceful about losing a node. I expected at least a brief HEALTH_WARN during the monitor removal, but the cluster stayed healthy through every step. The OSD drain had already moved all data, so the final purge was instant.
The order matters. Drain the OSD first (this takes time), then remove the Ceph daemons (monitors, managers, MDS), then leave the Proxmox cluster, then clean up the config directory. If you remove from corosync first, you lose access to the shared /etc/pve/ filesystem and make the Ceph cleanup harder.
Tailscale SSH was both a blessing and a curse. It made remote access trivial during the decommission, but once I stopped corosync and Tailscale on pve005, the hostname stopped resolving. Had to fall back to the raw IP for the final shutdown, and even that failed because SSH key auth was only configured through Tailscale. Ended up needing physical access.
Current State
The legacy cluster is down to 5 nodes (pve001-004, pve006) with 4 Ceph OSDs. Everything that was running on pve005 has been migrated or shut down. The hardware is sitting on the shelf, waiting to be sold or repurposed.
One less node drawing power. One less thing in the Ansible inventory. Progress.