Overview#
Step-by-step guide for adding a new Proxmox node to the cluster, including hardware preparation, cluster join, Ceph configuration, Tailscale, and Ansible automation.
Prerequisites#
- Physical or virtual hardware ready
- Network access to existing cluster nodes
- Proxmox VE ISO (latest stable version)
- Cluster network details (IPs, gateway, DNS)
- Access to vault password for Ansible
- Access to Tailscale admin console
Planning Phase#
1. Determine Node Specifications#
1
2
3
4
5
6
7
| Hostname: pveXXX (e.g., pve009)
Management IP: 10.x.x.XX
Purpose: General compute / GPU workload / Storage / etc.
RAM: XXX GB
CPU: Model and core count
Storage: Disk configuration plan
GPU: If applicable
|
2. Network Planning#
1
2
3
4
5
6
| # Determine next available IP in cluster range
# Check existing nodes
ansible all -i inventory/homelab.yml -m shell -a "hostname -I | awk '{print \$1}'"
# Or check manually
for i in {40..60}; do ping -c 1 -W 1 10.x.x.$i &>/dev/null && echo "10.x.x.$i - UP" || echo "10.x.x.$i - available"; done
|
3. Check Cluster Health Before Adding#
1
2
3
4
5
| # On existing cluster node
pvecm status
ceph -s # If using Ceph
# Ensure cluster is healthy before proceeding
|
Phase 1: Hardware Preparation#
A. Physical Hardware Setup#
1
2
3
4
5
6
7
8
9
10
| # Download latest Proxmox VE ISO
# https://www.proxmox.com/en/downloads
# Verify checksum
sha256sum proxmox-ve_*.iso
# Create bootable USB (macOS)
sudo dd if=proxmox-ve_*.iso of=/dev/diskX bs=1m status=progress
# Or use balenaEtcher, Rufus, etc.
|
Phase 2: Proxmox Installation#
- Boot from USB/DVD
- Select “Install Proxmox VE”
- Accept EULA
B. Target Disk Selection#
- Select installation disk (usually smallest SSD)
- Important: Note which disks you’re saving for Ceph/storage
- Consider ZFS for root disk if desired
C. Location and Time Zone#
1
2
3
| Country: United States (or appropriate)
Time zone: America/Los_Angeles (or appropriate)
Keyboard Layout: en-us
|
D. Password and Email#
1
2
| Root password: <use strong password, store in vault>
Email: your-email@example.com (or appropriate)
|
E. Network Configuration#
1
2
3
4
5
| Management Interface: Choose primary NIC (usually eno1, ens18, etc.)
Hostname (FQDN): pveXXX.localdomain (e.g., pve009.localdomain)
IP Address: 10.150.60.XX/24
Gateway: 10.150.60.1
DNS Server: 10.150.60.1 (or appropriate)
|
Important: Double-check IP is not in use!
F. BIOS Settings (Before Install)#
F. Complete Installation#
- Review summary
- Click “Install”
- Wait for installation to complete (~10 minutes)
- Remove installation media
- Reboot
Phase 3: Post-Installation Setup#
A. Initial Login and Validation#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # SSH to new node
ssh root@10.x.x.XX
# Verify basic connectivity
ping -c 3 8.8.8.8
ping -c 3 google.com
# Check time sync (critical for cluster)
timedatectl status
# Should show: System clock synchronized: yes
# If not synced
systemctl enable --now systemd-timesyncd
timedatectl set-ntp true
|
B. Fix Repositories and Subscription Nag (before updates)#
Run from the Proxmox host shell (not SSH) — the script uses interactive prompts that work best from the web UI console.
1
2
3
| # Run community post-install script — fixes enterprise repos, adds no-subscription repos,
# sets up Ceph repos, and removes subscription nag popup
bash -c "$(curl -fsSL https://raw.githubusercontent.com/community-scripts/ProxmoxVE/main/tools/pve/post-pve-install.sh)"
|
Note: Ansible (roles/proxmox/tasks/repositories.yml) also handles this, but repos must be correct before apt update and pveceph install will work.
C. Update System#
1
2
3
4
| apt update && apt full-upgrade -y
# May require reboot
reboot
|
D. Verify Web Interface Access#
1
2
3
4
| https://10.x.x.XX:8006
Login: root
Password: <root password>
|
Phase 3.5: VLAN Network Configuration#
Current VLAN Layout#
| VLAN | Purpose | Subnet | Host IPs |
|---|
| 60 | Management + cluster (untagged/native) | 10.150.60.0/24 | .21, .22, … |
| 65 | Storage / Ceph | 10.150.65.0/24 | .21, .22, … (no gateway) |
| 70 | Guest / VM traffic | 10.150.70.0/24 | no host IP needed |
Network Interfaces Config#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| auto lo
iface lo inet loopback
iface nic0 inet manual
auto vmbr0
iface vmbr0 inet static
address 10.150.60.XX/24
gateway 10.150.60.1
bridge-ports nic0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 65 70
auto vmbr0.65
iface vmbr0.65 inet static
address 10.150.65.XX/24
source /etc/network/interfaces.d/*
|
Verify VLAN Configuration#
1
2
3
4
5
6
7
8
9
10
11
12
| # Apply config
ifreload -a
# Verify VLAN interface
ip addr show vmbr0.65
# Verify bridge is passing VLANs on physical port
bridge vlan show dev nic0
# Must show VLANs 65 and 70, not just VLAN 1
# Test cross-node connectivity on storage VLAN
ping -c 3 10.150.65.XX # other node's storage IP
|
Switch Requirements (USW Flex / UniFi)#
- Port profile: trunk with native VLAN 60
- Tagged VLANs: 65, 70 (or “allow all”)
- VLANs 65 and 70 must exist as networks in UniFi controller
Phase 4: Join Proxmox Cluster#
Critical: Joining a cluster is irreversible without reinstalling. Make sure you’re ready!
A. Pre-Join Checks#
On new node:
1
2
3
4
5
6
7
8
9
10
11
| # Ensure no VMs or containers exist
qm list # Should be empty
pct list # Should be empty
# Verify time sync with cluster
ssh root@pve001 "date +%s" && date +%s
# Times should be within 1-2 seconds
# Verify cluster communication
ping -c 3 pve001
ssh root@pve001 "pvecm status"
|
On existing cluster node:
1
2
3
4
5
| pvecm status
# Note the cluster name
# Get join information
pvecm add --help
|
C. Join Cluster#
On new node:
1
2
3
4
5
6
7
8
9
10
11
12
| # Join cluster (use IP of existing node)
pvecm add 10.x.x.44
# You'll be prompted for:
# - Root password of existing node
# - Confirmation
# This will:
# - Configure corosync
# - Restart networking
# - Join cluster
# - Regenerate SSH host keys
|
Note: SSH connection will drop during join. Wait ~30 seconds, then reconnect.
D. Verify Cluster Membership#
On any cluster node:
1
2
3
4
5
6
7
8
9
| pvecm status
# Should show new node in member list
pvecm nodes
# Should list all nodes including new one
# Check quorum
pvecm expected 1
# Should show expected votes and quorum info
|
On new node:
1
2
3
4
5
| # Verify you can see cluster storage
pvesm status
# Verify you can see other nodes
pvesh get /nodes
|
Phase 5: Storage Configuration#
1
2
3
4
| # Check existing storage
pvesm status
# Local storage should already exist from installation
|
B. Ceph OSD Setup (If Using Ceph)#
Pre-requisites:
- At least 3 nodes in cluster for redundancy
- Dedicated disks for Ceph (separate from OS disk)
- Network connectivity to other Ceph nodes
On new node:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| # Install Ceph packages (if not already)
pveceph install
# Create Ceph monitor (if cluster needs more)
pveceph mon create
# Create Ceph manager
pveceph mgr create
# List available disks
lsblk
pveceph osd list
# Create OSDs for each data disk (replace /dev/sdX)
pveceph osd create /dev/sdb
pveceph osd create /dev/sdc
pveceph osd create /dev/sdd
|
C. Verify Ceph Health#
1
2
3
4
5
6
7
8
| ceph -s
# Should show: HEALTH_OK (after rebalancing completes)
ceph osd tree
# Should show new OSDs in topology
ceph osd df
# Should show OSDs with appropriate size
|
Note: Ceph will rebalance data. This can take hours or days depending on cluster size.
Phase 6: Tailscale Setup#
A. Manual Method (For Testing)#
1
2
3
4
5
6
7
8
| # On new node
curl -fsSL https://tailscale.com/install.sh | sh
# Authenticate (use auth key from Tailscale admin)
tailscale up --authkey=tskey-auth-XXXXX --hostname=pveXXX --advertise-tags=tag:proxmox --ssh
# Verify
tailscale status
|
B. Remove Old Tailscale Entry (If Hostname Existed)#
- Go to https://login.tailscale.com/admin/machines
- Search for
pveXXX - Delete any old entries with same hostname
- Re-run
tailscale up command above
C. Verify Tailscale Connectivity#
1
2
3
4
5
6
7
8
| # From your local machine
ping pveXXX.your-tailnet.ts.net
# SSH via Tailscale
ssh root@pveXXX.your-tailnet.ts.net
# Test web UI
# Open: https://pveXXX.your-tailnet.ts.net:8006
|
Phase 7: Ansible Configuration#
A. Add to Inventory#
1
2
3
4
5
6
7
| # Edit inventory file
vi ansible/inventory/homelab.yml
# Add new node under proxmox > nut_netclients > hosts
# Example:
# pve009:
# ansible_host: 10.x.x.49
|
B. Test Ansible Connectivity#
1
2
3
4
5
6
7
| # From your local machine in ansible directory
cd ~/8do/lab/ansible
# Ping test
ansible pve009 -i inventory/homelab.yml -m ping
# Should return: pong
|
C. Run Ansible Playbook (Check Mode First)#
1
2
3
4
| # Dry run to see what will change
ansible-playbook -i inventory/homelab.yml site.yml --limit pve009 --ask-vault-pass --check --diff
# Review output carefully
|
D. Apply Configuration#
1
2
3
4
5
6
7
8
9
| # Apply for real
ansible-playbook -i inventory/homelab.yml site.yml --limit pve009 --ask-vault-pass
# This will:
# - Configure repositories
# - Install packages
# - Configure Tailscale (update tags, setup SSH)
# - Request Tailscale TLS certificates for web UI
# - Configure NUT UPS client
|
E. Verify Ansible Run#
1
2
3
4
5
6
7
8
9
10
| # Check Tailscale certificate
ssh root@pve009 "ls -la /etc/pve/local/pveproxy-ssl.*"
# Verify NUT client
ssh root@pve009 "systemctl status nut-monitor"
ssh root@pve009 "upsc myups@10.x.x.44"
# Verify web UI with Tailscale cert
# Open: https://pve009.your-tailnet.ts.net:8006
# Should show valid cert (no browser warning)
|
Phase 8: Monitoring and Services#
A. Add to Netdata (If Using)#
1
2
3
4
5
| # Run Netdata playbook
ansible-playbook -i inventory/homelab.yml netdata_install.yml --limit pve009 --ask-vault-pass
# Verify in Netdata Cloud
# https://app.netdata.cloud/
|
B. Verify UPS Monitoring#
1
2
3
4
5
6
7
| # On new node
upsc myups@10.x.x.44
# Should show UPS status
# Check logs
journalctl -u nut-monitor -f
|
C. Update Documentation#
Phase 9: Final Verification#
Cluster Health Checklist#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| # Proxmox cluster
pvecm status # Quorum, all nodes online
pvecm nodes # All nodes listed
# Corosync
corosync-quorumtool -s # Should show expected votes
# Ceph (if applicable)
ceph -s # HEALTH_OK (after rebalancing)
ceph osd tree # New OSDs visible and up
# Tailscale
tailscale status # Connected, new node visible
# Services
systemctl status pveproxy
systemctl status pvedaemon
systemctl status pvestatd
systemctl status nut-monitor # If UPS client
|
Network Connectivity Tests#
1
2
3
4
5
6
7
8
9
10
11
| # From new node to cluster
ping -c 3 pve001
ping -c 3 10.x.x.44
# From local machine to new node
ping -c 3 10.x.x.XX
ping -c 3 pveXXX.your-tailnet.ts.net
# Test web UI
curl -k https://10.x.x.XX:8006
curl -k https://pveXXX.your-tailnet.ts.net:8006
|
Test VM/Container Creation#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| # Download a template (if not already available)
pveam update
pveam available
pveam download local debian-12-standard_12.2-1_amd64.tar.zst
# Create test container
pct create 999 local:vztmpl/debian-12-standard_12.2-1_amd64.tar.zst \
--hostname test \
--memory 512 \
--net0 name=eth0,bridge=vmbr0,ip=dhcp \
--storage local-lvm \
--rootfs local-lvm:8
# Start it
pct start 999
# Verify
pct status 999
pct exec 999 -- ping -c 3 google.com
# Clean up
pct stop 999
pct destroy 999
|
Phase 10: Optional GPU Configuration#
A. Verify GPU Visibility (If Applicable)#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # Check PCI devices
lspci | grep -i vga
lspci | grep -i nvidia
lspci | grep -i amd
# Check IOMMU groups
find /sys/kernel/iommu_groups/ -type l
# Load kernel modules
# For AMD
echo "options amdgpu si_support=1 cik_support=1" > /etc/modprobe.d/amdgpu.conf
update-initramfs -u
# For NVIDIA
# See specific GPU passthrough documentation
|
B. Test GPU Passthrough (If Needed)#
- Create test LXC with GPU device
- Verify device visibility inside container
- Test GPU workload (e.g., vainfo for VAAPI)
See: jellyfin-gpu-passthrough-lxc.md for detailed LXC GPU passthrough steps
Rollback Plan#
If something goes wrong during cluster join:
Before Cluster Join#
- Just reinstall Proxmox, start over
After Cluster Join#
Cluster join is irreversible! To remove:
- Follow the decommission runbook:
proxmox-node-decommission-runbook.md - Then reinstall and start over
Common Issues#
Issue: Cluster join fails with “no quorum”#
Solution:
1
2
3
4
5
| # On existing node, check quorum
pvecm expected 1
# May need to temporarily adjust expected votes
pvecm expected <current_nodes_count>
|
Issue: Time sync issues during join#
Solution:
1
2
3
4
5
6
| # Ensure time is synchronized
systemctl restart systemd-timesyncd
timedatectl set-ntp true
# Verify times match across cluster
date && ssh pve001 date
|
Issue: Corosync communication errors#
Solution:
1
2
3
4
5
| # Check firewall (should allow corosync ports)
# TCP/UDP: 5405-5412, 3121 (pmxcfs)
# Verify multicast is working
corosync-cfgtool -s
|
Issue: Tailscale hostname conflict (pveXXX-1)#
Prevention: Always delete old Tailscale entry before provisioning
Solution:
- Delete both entries in Tailscale admin
- Re-run
tailscale up command with correct hostname
Issue: Ceph won’t create OSDs#
Common causes:
1
2
3
4
5
6
7
8
9
| # Disk has existing partitions
wipefs -a /dev/sdX
sgdisk --zap-all /dev/sdX
# Disk is mounted
umount /dev/sdX*
# Disk has LVM signatures
pvremove /dev/sdX
|
Issue: Ansible can’t connect after cluster join#
Cause: SSH host keys changed during join
Solution:
1
2
3
4
5
6
7
| # Remove old host key
ssh-keygen -R 10.x.x.XX
ssh-keygen -R pveXXX
ssh-keygen -R pveXXX.your-tailnet.ts.net
# Reconnect to accept new key
ssh root@10.x.x.XX
|
Issue: NVIDIA GPU causes blank screen during install#
Solution: At boot menu, select Terminal UI, press e, add nomodeset to the linux line, Ctrl+X to boot.
If install still freezes: add initcall_blacklist=nvidiafb_init as well.
After install, make permanent in /etc/default/grub:
1
| GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt nomodeset"
|
Then update-grub and blacklist nouveau:
1
2
3
4
5
6
| cat > /etc/modprobe.d/blacklist-nvidia.conf << 'EOF'
blacklist nouveau
blacklist nvidia
blacklist nvidiafb
EOF
update-initramfs -u
|
Issue: VLAN traffic not passing on bridge-vlan-aware bridge#
Symptom: VLAN interface (e.g. vmbr0.65) transmits frames but receives 0. Pings between nodes on VLAN fail.
Cause: bridge-vlan-aware yes is set but bridge-vids is missing. The bridge only passes VLAN 1 (default) on the physical port.
Verify:
1
2
| bridge vlan show dev nic0
# If only VLAN 1 is listed, that's the problem
|
Solution: Add bridge-vids to vmbr0 in /etc/network/interfaces:
1
2
3
4
5
6
7
8
9
| auto vmbr0
iface vmbr0 inet static
address 10.150.60.XX/24
gateway 10.150.60.1
bridge-ports nic0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 65 70
|
Then ifreload -a and verify with bridge vlan show dev nic0.
Issue: Ceph pveceph mon create fails with “Could not connect to ceph cluster”#
Cause: On a fresh cluster, pveceph mon create tries to connect to monitors that don’t exist yet. The auto-bootstrap can fail if directories or keyrings are missing.
Solution — manual bootstrap:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # 1. Ensure directories exist
mkdir -p /var/lib/ceph/mon /var/lib/ceph/mgr
chown -R ceph:ceph /var/lib/ceph
# 2. Create monmap
monmaptool --create --add $(hostname) <storage-ip> \
--fsid $(grep fsid /etc/pve/ceph.conf | awk '{print $3}') /tmp/monmap
# 3. Bootstrap monitor as root, then fix ownership
ceph-mon --mkfs -i $(hostname) --monmap /tmp/monmap \
--keyring /etc/pve/priv/ceph.mon.keyring
chown -R ceph:ceph /var/lib/ceph/mon/ceph-$(hostname)
# 4. Start and verify
systemctl start ceph-mon@$(hostname)
ceph -s
|
Issue: /etc/pve/priv/ceph.* keyrings deleted across cluster#
Cause: /etc/pve/priv/ is shared via pmxcfs. Running rm -f /etc/pve/priv/ceph.* on ANY node deletes keyrings for ALL nodes.
Recovery: Extract from running monitor:
1
2
3
| ceph --keyring /var/lib/ceph/mon/ceph-$(hostname)/keyring \
--name mon. auth get client.admin -o /etc/pve/priv/ceph.client.admin.keyring
cp /var/lib/ceph/mon/ceph-$(hostname)/keyring /etc/pve/priv/ceph.mon.keyring
|
Post-Onboarding Tasks#
Short Term (Within Week)#
Long Term#
Appendix: Quick Reference Commands#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| # Cluster status
pvecm status
pvecm nodes
corosync-quorumtool -s
# Ceph status
ceph -s
ceph osd tree
ceph osd df
ceph health detail
# Tailscale
tailscale status
tailscale ping pve001
# Services
systemctl status pveproxy pvedaemon pvestatd
systemctl status nut-monitor
# Ansible
ansible pveXXX -i inventory/homelab.yml -m ping
ansible-playbook -i inventory/homelab.yml site.yml --limit pveXXX --ask-vault-pass --check --diff
# Network tests
ping 10.x.x.1
ping pve001
ping pveXXX.your-tailnet.ts.net
|
Notes#
- Timing: Full onboarding takes 2-4 hours depending on:
- Ceph rebalancing time
- Number of updates to install
- GPU/hardware configuration complexity
- Planning: Schedule during maintenance window
- Backups: Ensure cluster backups are current before adding nodes
- Testing: Always test on one node before bulk operations
- Documentation: Keep this runbook updated with lessons learned
Changelog#
- 2026-02-28: Updated with pve01/pve02 rebuild lessons
- Added BIOS settings checklist (IOMMU, SVM, Secure Boot, Above 4G Decoding)
- Added NVIDIA nomodeset boot fix for installer
- Added VLAN network config with bridge-vids requirement
- Added Ceph manual bootstrap procedure
- Added bridge-vlan-aware troubleshooting
- 2024-12-04: Initial runbook created
- Added Tailscale hostname conflict prevention
- Added Ansible automation steps
- Added GPU configuration appendix
- Added comprehensive verification checklists