The Setup
I have a 3-node Proxmox 9 cluster, each with a 4TB SSD and 2TB HDD dedicated to Ceph. The NVMe drives stay local for fast VM storage. The question was: how do I use the SSDs for performance-sensitive workloads and the HDDs for bulk storage?
| |
CRUSH Rules: Telling Ceph Where to Put Data
Ceph already knows which OSDs are SSDs and which are HDDs — it assigns device classes automatically. But by default, it’ll spread data across all OSDs regardless of type. CRUSH rules let you pin pools to specific device classes.
| |
That’s it. Two rules: one that only uses SSDs, one that only uses HDDs. The host parameter ensures replicas land on different nodes.
The Pool Layout
I created six pools across two storage types:
CephFS Pools (shared filesystem)
CephFS needs a metadata pool and one or more data pools. Metadata goes on SSD because it’s small and latency-sensitive.
| |
Then create the filesystem and add the HDD pool as a secondary data pool:
| |
RBD Pools (block devices for VMs/LXCs)
CephFS is great for shared files, but VM and LXC disks need block storage. RBD (RADOS Block Device) is purpose-built for this.
| |
Why Both CephFS and RBD?
They serve different purposes on the same OSDs:
| Storage | Type | Best For |
|---|---|---|
| CephFS | Shared filesystem | ISOs, templates, backups, media libraries |
| RBD | Block device | VM disks, LXC rootfs — anything that needs raw I/O |
Think of it like having both an NFS share and a SAN on the same hardware.
MDS: The Missing Piece
CephFS requires at least one Metadata Server (MDS). On Proxmox, create one per node — the active MDS handles metadata operations, standbys take over if it fails.
| |
Verify with ceph mds stat — you should see one active and two standby.
Adding Storage to Proxmox
This is where I hit a snag. My Ceph monitors bind to the storage VLAN (10.150.65.x), not the management VLAN (10.150.60.x). Using the wrong IPs for --monhost caused mount timeouts and cryptic errors.
The fix: always use the IPs where your Ceph monitors actually listen.
| |
The Final Layout
| |
Five storage tiers, each with a clear purpose:
- nvme-local: Fastest. Gaming VMs with GPU passthrough, anything that needs raw local NVMe speed.
- rbd-ssd: Fast, shared. LXC/VM root disks that need to migrate between nodes. Jellyfin’s OS disk goes here.
- rbd-hdd: Shared, bulk. VMs that don’t need speed.
- cephfs-ssd: Shared filesystem. ISOs, LXC templates, cloud-init snippets. Upload once, available on all nodes.
- cephfs-hdd: Shared filesystem, bulk. Backups and media libraries. Jellyfin’s
/mnt/librarygoes here.
Practical Example: Jellyfin
When I migrate Jellyfin to the new cluster, the storage split will be:
- Root disk →
rbd-ssd(OS, app, database — needs random I/O) - Media library →
cephfs-hdd(movies, music, TV — sequential reads, doesn’t need SSD)
The media is replicated across all three nodes automatically. If a node goes down, Jellyfin keeps streaming from the surviving copies.
Gotchas
- Monitor IPs matter:
ceph mon dumpshows where monitors actually listen. If you have a separate storage VLAN, use those IPs — not your management VLAN. pvesm removeis cluster-wide: Learned this the hard way during VM migrations. Removing storage on one node removes the config from all nodes.- Bulk flag: Set
ceph osd pool set <pool> bulk trueon HDD pools so the PG autoscaler makes better decisions. - pmxcfs locks: If
pvesm addtimes out, the Proxmox cluster config can get locked.systemctl restart pve-clusterusually clears it. - Keyring files: CephFS needs both a keyring and a secret file in
/etc/pve/priv/ceph/. RBD pools create these automatically, CephFS sometimes doesn’t.