Retiring pve005: Decommissioning a Proxmox Node the Hard Way

Why Now

pve005 was an i5-7500 with 16GB of RAM. It ran the original Jellyfin LXC with 1.1TB of media on a local ZFS pool. Once I rebuilt the lab around a 3-node Ryzen 9 cluster with Ceph storage, pve005 became dead weight. The media was migrated to jellyfin01 on the new cluster months ago. The old LXC was stopped. pve005 was still drawing power, still in the Ceph quorum, and still showing up in every Ansible run.

Time to pull the plug.

The Starting State

pve005 was part of a 6-node legacy Proxmox 8 cluster (pve001-006) with a 5-node Ceph pool. It was running:

Ceph OSD.4 (930GB SSD, part of the legacy Ceph cluster)
Ceph monitor, manager, and metadata server
NUT client (network UPS monitoring)
Netdata agent
Tailscale

No VMs, no cron jobs. Just the stopped jellyfin LXC (3001) with a NAT redirect forwarding old traffic to the new server.

Step 1: Drain the OSD

I had already marked OSD.4 as out a week earlier to let the data rebalance at its own pace. Checking the OSD status confirmed it was fully drained: zero bytes, zero placement groups.

1
ceph osd df

OSD.4 showed 0B across the board. The cluster was HEALTH_OK with data evenly distributed across the remaining four OSDs. Safe to remove.

Stopped the daemon, purged the OSD, and cleaned up the CRUSH map:

1
2
3
systemctl stop ceph-osd@4
ceph osd purge 4 --yes-i-really-mean-it
ceph osd crush remove pve005

The --yes-i-really-mean-it flag is one of those Ceph things. It removes the OSD from the CRUSH map, deletes its auth keys, and wipes its entry from the OSD map in one shot.

Step 2: Remove the Monitor, Manager, and MDS

With the OSD gone, pve005 was still a Ceph monitor (quorum member), manager (standby), and metadata server (standby). Removing these drops the cluster from 5 monitors to 4, which still has quorum (needs 3 of 4).

The monitor was already gone when I checked. Must have been removed during an earlier maintenance window I forgot about. The manager and MDS needed manual cleanup:

1
2
3
4
5
6
7
systemctl stop ceph-mgr@pve005
systemctl disable ceph-mgr@pve005
ceph auth del mgr.pve005

systemctl stop ceph-mds@pve005
systemctl disable ceph-mds@pve005
ceph auth del mds.pve005

After each removal, ceph status stayed HEALTH_OK. Four monitors, three manager standbys, three MDS standbys. The cluster barely noticed.

Step 3: Leave the Corosync Cluster

This is the Proxmox-level removal. First, stop cluster services on pve005 so it leaves the ring:

1
systemctl stop pve-cluster corosync

Then from another node (pve001):

1
pvecm delnode pve005

The command printed Could not kill node (error = CS_ERR_NOT_EXIST) because the node was already gone from the perspective of corosync. The cluster config version bumped from 16 to 17, and pvecm status showed 5 nodes.

Cleaned up the leftover node config:

1
rm -rf /etc/pve/nodes/pve005/

That directory had the old jellyfin LXC config (3001.conf), SSL certs, and some cluster state. All irrelevant now.

Step 4: Clean Up Everything Else

The remaining cleanup was quick:

Ansible inventory: commented out pve005 from the nut_netclients group
Tailscale admin panel: removed the machine
Netdata Cloud: removed the node
Power: shut down via SSH, then physically at the rack

What Surprised Me

Ceph is remarkably graceful about losing a node. I expected at least a brief HEALTH_WARN during the monitor removal, but the cluster stayed healthy through every step. The OSD drain had already moved all data, so the final purge was instant.

The order matters. Drain the OSD first (this takes time), then remove the Ceph daemons (monitors, managers, MDS), then leave the Proxmox cluster, then clean up the config directory. If you remove from corosync first, you lose access to the shared /etc/pve/ filesystem and make the Ceph cleanup harder.

Tailscale SSH was both a blessing and a curse. It made remote access trivial during the decommission, but once I stopped corosync and Tailscale on pve005, the hostname stopped resolving. Had to fall back to the raw IP for the final shutdown, and even that failed because SSH key auth was only configured through Tailscale. Ended up needing physical access.

Current State

The legacy cluster is down to 5 nodes (pve001-004, pve006) with 4 Ceph OSDs. Everything that was running on pve005 has been migrated or shut down. The hardware is sitting on the shelf, waiting to be sold or repurposed.

One less node drawing power. One less thing in the Ansible inventory. Progress.

Why Now#

The Starting State#

Step 1: Drain the OSD#

Step 2: Remove the Monitor, Manager, and MDS#

Step 3: Leave the Corosync Cluster#

Step 4: Clean Up Everything Else#

What Surprised Me#

Current State#