Overview#
Step-by-step guide for adding a new Proxmox node to the cluster, including hardware preparation, cluster join, Ceph configuration, Tailscale, and Ansible automation.
Prerequisites#
- Physical or virtual hardware ready
- Network access to existing cluster nodes
- Proxmox VE ISO (latest stable version)
- Cluster network details (IPs, gateway, DNS)
- Access to vault password for Ansible
- Access to Tailscale admin console
Planning Phase#
1. Determine Node Specifications#
1
2
3
4
5
6
7
| Hostname: pveXXX (e.g., pve009)
Management IP: 10.x.x.XX
Purpose: General compute / GPU workload / Storage / etc.
RAM: XXX GB
CPU: Model and core count
Storage: Disk configuration plan
GPU: If applicable
|
2. Network Planning#
1
2
3
4
5
6
| # Determine next available IP in cluster range
# Check existing nodes
ansible all -i inventory/homelab.yml -m shell -a "hostname -I | awk '{print \$1}'"
# Or check manually
for i in {40..60}; do ping -c 1 -W 1 10.x.x.$i &>/dev/null && echo "10.x.x.$i - UP" || echo "10.x.x.$i - available"; done
|
3. Check Cluster Health Before Adding#
1
2
3
4
5
| # On existing cluster node
pvecm status
ceph -s # If using Ceph
# Ensure cluster is healthy before proceeding
|
Phase 1: Hardware Preparation#
A. Physical Hardware Setup#
1
2
3
4
5
6
7
8
9
10
| # Download latest Proxmox VE ISO
# https://www.proxmox.com/en/downloads
# Verify checksum
sha256sum proxmox-ve_*.iso
# Create bootable USB (macOS)
sudo dd if=proxmox-ve_*.iso of=/dev/diskX bs=1m status=progress
# Or use balenaEtcher, Rufus, etc.
|
Phase 2: Proxmox Installation#
- Boot from USB/DVD
- Select “Install Proxmox VE”
- Accept EULA
B. Target Disk Selection#
- Select installation disk (usually smallest SSD)
- Important: Note which disks you’re saving for Ceph/storage
- Consider ZFS for root disk if desired
C. Location and Time Zone#
1
2
3
| Country: United States (or appropriate)
Time zone: America/Los_Angeles (or appropriate)
Keyboard Layout: en-us
|
D. Password and Email#
1
2
| Root password: <use strong password, store in vault>
Email: your-email@example.com (or appropriate)
|
E. Network Configuration#
1
2
3
4
5
| Management Interface: Choose primary NIC (usually eno1, ens18, etc.)
Hostname (FQDN): pveXXX.localdomain (e.g., pve009.localdomain)
IP Address: 10.x.x.XX/24
Gateway: 10.x.x.1
DNS Server: 10.x.x.1 (or appropriate)
|
Important: Double-check IP is not in use!
F. Complete Installation#
- Review summary
- Click “Install”
- Wait for installation to complete (~10 minutes)
- Remove installation media
- Reboot
Phase 3: Post-Installation Setup#
A. Initial Login and Validation#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # SSH to new node
ssh root@10.x.x.XX
# Verify basic connectivity
ping -c 3 8.8.8.8
ping -c 3 google.com
# Check time sync (critical for cluster)
timedatectl status
# Should show: System clock synchronized: yes
# If not synced
systemctl enable --now systemd-timesyncd
timedatectl set-ntp true
|
B. Disable Enterprise Repositories (if no subscription)#
1
2
3
4
5
| # This is automated by Ansible, but good to verify manually first
cat /etc/apt/sources.list.d/pve-enterprise.list
cat /etc/apt/sources.list.d/ceph.list
# Should see they're commented out or have the right repos
|
C. Update System#
1
2
3
4
5
| apt update
apt dist-upgrade -y
# May require reboot
reboot
|
D. Verify Web Interface Access#
1
2
3
4
| https://10.x.x.XX:8006
Login: root
Password: <root password>
|
Phase 4: Join Proxmox Cluster#
Critical: Joining a cluster is irreversible without reinstalling. Make sure you’re ready!
A. Pre-Join Checks#
On new node:
1
2
3
4
5
6
7
8
9
10
11
| # Ensure no VMs or containers exist
qm list # Should be empty
pct list # Should be empty
# Verify time sync with cluster
ssh root@pve001 "date +%s" && date +%s
# Times should be within 1-2 seconds
# Verify cluster communication
ping -c 3 pve001
ssh root@pve001 "pvecm status"
|
On existing cluster node:
1
2
3
4
5
| pvecm status
# Note the cluster name
# Get join information
pvecm add --help
|
C. Join Cluster#
On new node:
1
2
3
4
5
6
7
8
9
10
11
12
| # Join cluster (use IP of existing node)
pvecm add 10.x.x.44
# You'll be prompted for:
# - Root password of existing node
# - Confirmation
# This will:
# - Configure corosync
# - Restart networking
# - Join cluster
# - Regenerate SSH host keys
|
Note: SSH connection will drop during join. Wait ~30 seconds, then reconnect.
D. Verify Cluster Membership#
On any cluster node:
1
2
3
4
5
6
7
8
9
| pvecm status
# Should show new node in member list
pvecm nodes
# Should list all nodes including new one
# Check quorum
pvecm expected 1
# Should show expected votes and quorum info
|
On new node:
1
2
3
4
5
| # Verify you can see cluster storage
pvesm status
# Verify you can see other nodes
pvesh get /nodes
|
Phase 5: Storage Configuration#
1
2
3
4
| # Check existing storage
pvesm status
# Local storage should already exist from installation
|
B. Ceph OSD Setup (If Using Ceph)#
Pre-requisites:
- At least 3 nodes in cluster for redundancy
- Dedicated disks for Ceph (separate from OS disk)
- Network connectivity to other Ceph nodes
On new node:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| # Install Ceph packages (if not already)
pveceph install
# Create Ceph monitor (if cluster needs more)
pveceph mon create
# Create Ceph manager
pveceph mgr create
# List available disks
lsblk
pveceph osd list
# Create OSDs for each data disk (replace /dev/sdX)
pveceph osd create /dev/sdb
pveceph osd create /dev/sdc
pveceph osd create /dev/sdd
|
C. Verify Ceph Health#
1
2
3
4
5
6
7
8
| ceph -s
# Should show: HEALTH_OK (after rebalancing completes)
ceph osd tree
# Should show new OSDs in topology
ceph osd df
# Should show OSDs with appropriate size
|
Note: Ceph will rebalance data. This can take hours or days depending on cluster size.
Phase 6: Tailscale Setup#
A. Manual Method (For Testing)#
1
2
3
4
5
6
7
8
| # On new node
curl -fsSL https://tailscale.com/install.sh | sh
# Authenticate (use auth key from Tailscale admin)
tailscale up --authkey=tskey-auth-XXXXX --hostname=pveXXX --advertise-tags=tag:proxmox --ssh
# Verify
tailscale status
|
B. Remove Old Tailscale Entry (If Hostname Existed)#
- Go to https://login.tailscale.com/admin/machines
- Search for
pveXXX - Delete any old entries with same hostname
- Re-run
tailscale up command above
C. Verify Tailscale Connectivity#
1
2
3
4
5
6
7
8
| # From your local machine
ping pveXXX.your-tailnet.ts.net
# SSH via Tailscale
ssh root@pveXXX.your-tailnet.ts.net
# Test web UI
# Open: https://pveXXX.your-tailnet.ts.net:8006
|
Phase 7: Ansible Configuration#
A. Add to Inventory#
1
2
3
4
5
6
7
| # Edit inventory file
vi ansible/inventory/homelab.yml
# Add new node under proxmox > nut_netclients > hosts
# Example:
# pve009:
# ansible_host: 10.x.x.49
|
B. Test Ansible Connectivity#
1
2
3
4
5
6
7
| # From your local machine in ansible directory
cd ~/8do/lab/ansible
# Ping test
ansible pve009 -i inventory/homelab.yml -m ping
# Should return: pong
|
C. Run Ansible Playbook (Check Mode First)#
1
2
3
4
| # Dry run to see what will change
ansible-playbook -i inventory/homelab.yml site.yml --limit pve009 --ask-vault-pass --check --diff
# Review output carefully
|
D. Apply Configuration#
1
2
3
4
5
6
7
8
9
| # Apply for real
ansible-playbook -i inventory/homelab.yml site.yml --limit pve009 --ask-vault-pass
# This will:
# - Configure repositories
# - Install packages
# - Configure Tailscale (update tags, setup SSH)
# - Request Tailscale TLS certificates for web UI
# - Configure NUT UPS client
|
E. Verify Ansible Run#
1
2
3
4
5
6
7
8
9
10
| # Check Tailscale certificate
ssh root@pve009 "ls -la /etc/pve/local/pveproxy-ssl.*"
# Verify NUT client
ssh root@pve009 "systemctl status nut-monitor"
ssh root@pve009 "upsc myups@10.x.x.44"
# Verify web UI with Tailscale cert
# Open: https://pve009.your-tailnet.ts.net:8006
# Should show valid cert (no browser warning)
|
Phase 8: Monitoring and Services#
A. Add to Netdata (If Using)#
1
2
3
4
5
| # Run Netdata playbook
ansible-playbook -i inventory/homelab.yml netdata_install.yml --limit pve009 --ask-vault-pass
# Verify in Netdata Cloud
# https://app.netdata.cloud/
|
B. Verify UPS Monitoring#
1
2
3
4
5
6
7
| # On new node
upsc myups@10.x.x.44
# Should show UPS status
# Check logs
journalctl -u nut-monitor -f
|
C. Update Documentation#
Phase 9: Final Verification#
Cluster Health Checklist#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| # Proxmox cluster
pvecm status # Quorum, all nodes online
pvecm nodes # All nodes listed
# Corosync
corosync-quorumtool -s # Should show expected votes
# Ceph (if applicable)
ceph -s # HEALTH_OK (after rebalancing)
ceph osd tree # New OSDs visible and up
# Tailscale
tailscale status # Connected, new node visible
# Services
systemctl status pveproxy
systemctl status pvedaemon
systemctl status pvestatd
systemctl status nut-monitor # If UPS client
|
Network Connectivity Tests#
1
2
3
4
5
6
7
8
9
10
11
| # From new node to cluster
ping -c 3 pve001
ping -c 3 10.x.x.44
# From local machine to new node
ping -c 3 10.x.x.XX
ping -c 3 pveXXX.your-tailnet.ts.net
# Test web UI
curl -k https://10.x.x.XX:8006
curl -k https://pveXXX.your-tailnet.ts.net:8006
|
Test VM/Container Creation#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| # Download a template (if not already available)
pveam update
pveam available
pveam download local debian-12-standard_12.2-1_amd64.tar.zst
# Create test container
pct create 999 local:vztmpl/debian-12-standard_12.2-1_amd64.tar.zst \
--hostname test \
--memory 512 \
--net0 name=eth0,bridge=vmbr0,ip=dhcp \
--storage local-lvm \
--rootfs local-lvm:8
# Start it
pct start 999
# Verify
pct status 999
pct exec 999 -- ping -c 3 google.com
# Clean up
pct stop 999
pct destroy 999
|
Phase 10: Optional GPU Configuration#
A. Verify GPU Visibility (If Applicable)#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # Check PCI devices
lspci | grep -i vga
lspci | grep -i nvidia
lspci | grep -i amd
# Check IOMMU groups
find /sys/kernel/iommu_groups/ -type l
# Load kernel modules
# For AMD
echo "options amdgpu si_support=1 cik_support=1" > /etc/modprobe.d/amdgpu.conf
update-initramfs -u
# For NVIDIA
# See specific GPU passthrough documentation
|
B. Test GPU Passthrough (If Needed)#
- Create test LXC with GPU device
- Verify device visibility inside container
- Test GPU workload (e.g., vainfo for VAAPI)
See: jellyfin-gpu-passthrough-lxc.md for detailed LXC GPU passthrough steps
Rollback Plan#
If something goes wrong during cluster join:
Before Cluster Join#
- Just reinstall Proxmox, start over
After Cluster Join#
Cluster join is irreversible! To remove:
- Follow the decommission runbook:
proxmox-node-decommission-runbook.md - Then reinstall and start over
Common Issues#
Issue: Cluster join fails with “no quorum”#
Solution:
1
2
3
4
5
| # On existing node, check quorum
pvecm expected 1
# May need to temporarily adjust expected votes
pvecm expected <current_nodes_count>
|
Issue: Time sync issues during join#
Solution:
1
2
3
4
5
6
| # Ensure time is synchronized
systemctl restart systemd-timesyncd
timedatectl set-ntp true
# Verify times match across cluster
date && ssh pve001 date
|
Issue: Corosync communication errors#
Solution:
1
2
3
4
5
| # Check firewall (should allow corosync ports)
# TCP/UDP: 5405-5412, 3121 (pmxcfs)
# Verify multicast is working
corosync-cfgtool -s
|
Issue: Tailscale hostname conflict (pveXXX-1)#
Prevention: Always delete old Tailscale entry before provisioning
Solution:
- Delete both entries in Tailscale admin
- Re-run
tailscale up command with correct hostname
Issue: Ceph won’t create OSDs#
Common causes:
1
2
3
4
5
6
7
8
9
| # Disk has existing partitions
wipefs -a /dev/sdX
sgdisk --zap-all /dev/sdX
# Disk is mounted
umount /dev/sdX*
# Disk has LVM signatures
pvremove /dev/sdX
|
Issue: Ansible can’t connect after cluster join#
Cause: SSH host keys changed during join
Solution:
1
2
3
4
5
6
7
| # Remove old host key
ssh-keygen -R 10.x.x.XX
ssh-keygen -R pveXXX
ssh-keygen -R pveXXX.your-tailnet.ts.net
# Reconnect to accept new key
ssh root@10.x.x.XX
|
Post-Onboarding Tasks#
Short Term (Within Week)#
Long Term#
Appendix: Quick Reference Commands#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| # Cluster status
pvecm status
pvecm nodes
corosync-quorumtool -s
# Ceph status
ceph -s
ceph osd tree
ceph osd df
ceph health detail
# Tailscale
tailscale status
tailscale ping pve001
# Services
systemctl status pveproxy pvedaemon pvestatd
systemctl status nut-monitor
# Ansible
ansible pveXXX -i inventory/homelab.yml -m ping
ansible-playbook -i inventory/homelab.yml site.yml --limit pveXXX --ask-vault-pass --check --diff
# Network tests
ping 10.x.x.1
ping pve001
ping pveXXX.your-tailnet.ts.net
|
Notes#
- Timing: Full onboarding takes 2-4 hours depending on:
- Ceph rebalancing time
- Number of updates to install
- GPU/hardware configuration complexity
- Planning: Schedule during maintenance window
- Backups: Ensure cluster backups are current before adding nodes
- Testing: Always test on one node before bulk operations
- Documentation: Keep this runbook updated with lessons learned
Changelog#
- 2024-12-04: Initial runbook created
- Added Tailscale hostname conflict prevention
- Added Ansible automation steps
- Added GPU configuration appendix
- Added comprehensive verification checklists