Proxmox Node Onboarding Runbook

Overview

Step-by-step guide for adding a new Proxmox node to the cluster, including hardware preparation, cluster join, Ceph configuration, Tailscale, and Ansible automation.

Prerequisites

Physical or virtual hardware ready
Network access to existing cluster nodes
Proxmox VE ISO (latest stable version)
Cluster network details (IPs, gateway, DNS)
Access to vault password for Ansible
Access to Tailscale admin console

Planning Phase

1. Determine Node Specifications

1
2
3
4
5
6
7
Hostname: pveXXX (e.g., pve009)
Management IP: 10.x.x.XX
Purpose: General compute / GPU workload / Storage / etc.
RAM: XXX GB
CPU: Model and core count
Storage: Disk configuration plan
GPU: If applicable

2. Network Planning

1
2
3
4
5
6
# Determine next available IP in cluster range
# Check existing nodes
ansible all -i inventory/homelab.yml -m shell -a "hostname -I | awk '{print \$1}'"

# Or check manually
for i in {40..60}; do ping -c 1 -W 1 10.x.x.$i &>/dev/null && echo "10.x.x.$i - UP" || echo "10.x.x.$i - available"; done

3. Check Cluster Health Before Adding

1
2
3
4
5
# On existing cluster node
pvecm status
ceph -s  # If using Ceph

# Ensure cluster is healthy before proceeding

Phase 1: Hardware Preparation

A. Physical Hardware Setup

Install hardware in rack/location
Connect power (note: consider UPS connectivity)
Connect network cables
Verify BIOS settings:
- Virtualization enabled (VT-x/AMD-V)
- IOMMU enabled (if doing GPU passthrough)
- Boot order (network/USB/disk as needed)
- Power settings (restore on AC power loss)

B. Create Installation Media

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Download latest Proxmox VE ISO
# https://www.proxmox.com/en/downloads

# Verify checksum
sha256sum proxmox-ve_*.iso

# Create bootable USB (macOS)
sudo dd if=proxmox-ve_*.iso of=/dev/diskX bs=1m status=progress

# Or use balenaEtcher, Rufus, etc.

Phase 2: Proxmox Installation

A. Boot from Installation Media

Boot from USB/DVD
Select “Install Proxmox VE”
Accept EULA

B. Target Disk Selection

Select installation disk (usually smallest SSD)
Important: Note which disks you’re saving for Ceph/storage
Consider ZFS for root disk if desired

C. Location and Time Zone

1
2
3
Country: United States (or appropriate)
Time zone: America/Los_Angeles (or appropriate)
Keyboard Layout: en-us

D. Password and Email

1
2
Root password: <use strong password, store in vault>
Email: your-email@example.com (or appropriate)

E. Network Configuration

1
2
3
4
5
Management Interface: Choose primary NIC (usually eno1, ens18, etc.)
Hostname (FQDN): pveXXX.localdomain (e.g., pve009.localdomain)
IP Address: 10.x.x.XX/24
Gateway: 10.x.x.1
DNS Server: 10.x.x.1 (or appropriate)

Important: Double-check IP is not in use!

F. Complete Installation

Review summary
Click “Install”
Wait for installation to complete (~10 minutes)
Remove installation media
Reboot

Phase 3: Post-Installation Setup

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# SSH to new node
ssh root@10.x.x.XX

# Verify basic connectivity
ping -c 3 8.8.8.8
ping -c 3 google.com

# Check time sync (critical for cluster)
timedatectl status
# Should show: System clock synchronized: yes

# If not synced
systemctl enable --now systemd-timesyncd
timedatectl set-ntp true

B. Disable Enterprise Repositories (if no subscription)

1
2
3
4
5
# This is automated by Ansible, but good to verify manually first
cat /etc/apt/sources.list.d/pve-enterprise.list
cat /etc/apt/sources.list.d/ceph.list

# Should see they're commented out or have the right repos

C. Update System

1
2
3
4
5
apt update
apt dist-upgrade -y

# May require reboot
reboot

D. Verify Web Interface Access

1
2
3
4
https://10.x.x.XX:8006

Login: root
Password: <root password>

Phase 4: Join Proxmox Cluster

Critical: Joining a cluster is irreversible without reinstalling. Make sure you’re ready!

A. Pre-Join Checks

On new node:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Ensure no VMs or containers exist
qm list    # Should be empty
pct list   # Should be empty

# Verify time sync with cluster
ssh root@pve001 "date +%s" && date +%s
# Times should be within 1-2 seconds

# Verify cluster communication
ping -c 3 pve001
ssh root@pve001 "pvecm status"

B. Get Cluster Join Information

On existing cluster node:

1
2
3
4
5
pvecm status
# Note the cluster name

# Get join information
pvecm add --help

C. Join Cluster

On new node:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Join cluster (use IP of existing node)
pvecm add 10.x.x.44

# You'll be prompted for:
# - Root password of existing node
# - Confirmation

# This will:
# - Configure corosync
# - Restart networking
# - Join cluster
# - Regenerate SSH host keys

Note: SSH connection will drop during join. Wait ~30 seconds, then reconnect.

D. Verify Cluster Membership

On any cluster node:

1
2
3
4
5
6
7
8
9
pvecm status
# Should show new node in member list

pvecm nodes
# Should list all nodes including new one

# Check quorum
pvecm expected 1
# Should show expected votes and quorum info

On new node:

1
2
3
4
5
# Verify you can see cluster storage
pvesm status

# Verify you can see other nodes
pvesh get /nodes

Phase 5: Storage Configuration

A. Local Storage (Skip if Already Configured)

1
2
3
4
# Check existing storage
pvesm status

# Local storage should already exist from installation

B. Ceph OSD Setup (If Using Ceph)

Pre-requisites:

At least 3 nodes in cluster for redundancy
Dedicated disks for Ceph (separate from OS disk)
Network connectivity to other Ceph nodes

On new node:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Install Ceph packages (if not already)
pveceph install

# Create Ceph monitor (if cluster needs more)
pveceph mon create

# Create Ceph manager
pveceph mgr create

# List available disks
lsblk
pveceph osd list

# Create OSDs for each data disk (replace /dev/sdX)
pveceph osd create /dev/sdb
pveceph osd create /dev/sdc
pveceph osd create /dev/sdd

C. Verify Ceph Health

1
2
3
4
5
6
7
8
ceph -s
# Should show: HEALTH_OK (after rebalancing completes)

ceph osd tree
# Should show new OSDs in topology

ceph osd df
# Should show OSDs with appropriate size

Note: Ceph will rebalance data. This can take hours or days depending on cluster size.

Phase 6: Tailscale Setup

A. Manual Method (For Testing)

1
2
3
4
5
6
7
8
# On new node
curl -fsSL https://tailscale.com/install.sh | sh

# Authenticate (use auth key from Tailscale admin)
tailscale up --authkey=tskey-auth-XXXXX --hostname=pveXXX --advertise-tags=tag:proxmox --ssh

# Verify
tailscale status

B. Remove Old Tailscale Entry (If Hostname Existed)

Go to https://login.tailscale.com/admin/machines
Search for pveXXX
Delete any old entries with same hostname
Re-run tailscale up command above

C. Verify Tailscale Connectivity

1
2
3
4
5
6
7
8
# From your local machine
ping pveXXX.your-tailnet.ts.net

# SSH via Tailscale
ssh root@pveXXX.your-tailnet.ts.net

# Test web UI
# Open: https://pveXXX.your-tailnet.ts.net:8006

Phase 7: Ansible Configuration

A. Add to Inventory

1
2
3
4
5
6
7
# Edit inventory file
vi ansible/inventory/homelab.yml

# Add new node under proxmox > nut_netclients > hosts
# Example:
#     pve009:
#       ansible_host: 10.x.x.49

B. Test Ansible Connectivity

1
2
3
4
5
6
7
# From your local machine in ansible directory
cd ~/8do/lab/ansible

# Ping test
ansible pve009 -i inventory/homelab.yml -m ping

# Should return: pong

C. Run Ansible Playbook (Check Mode First)

1
2
3
4
# Dry run to see what will change
ansible-playbook -i inventory/homelab.yml site.yml --limit pve009 --ask-vault-pass --check --diff

# Review output carefully

D. Apply Configuration

1
2
3
4
5
6
7
8
9
# Apply for real
ansible-playbook -i inventory/homelab.yml site.yml --limit pve009 --ask-vault-pass

# This will:
# - Configure repositories
# - Install packages
# - Configure Tailscale (update tags, setup SSH)
# - Request Tailscale TLS certificates for web UI
# - Configure NUT UPS client

E. Verify Ansible Run

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Check Tailscale certificate
ssh root@pve009 "ls -la /etc/pve/local/pveproxy-ssl.*"

# Verify NUT client
ssh root@pve009 "systemctl status nut-monitor"
ssh root@pve009 "upsc myups@10.x.x.44"

# Verify web UI with Tailscale cert
# Open: https://pve009.your-tailnet.ts.net:8006
# Should show valid cert (no browser warning)

Phase 8: Monitoring and Services

A. Add to Netdata (If Using)

1
2
3
4
5
# Run Netdata playbook
ansible-playbook -i inventory/homelab.yml netdata_install.yml --limit pve009 --ask-vault-pass

# Verify in Netdata Cloud
# https://app.netdata.cloud/

B. Verify UPS Monitoring

1
2
3
4
5
6
7
# On new node
upsc myups@10.x.x.44

# Should show UPS status

# Check logs
journalctl -u nut-monitor -f

C. Update Documentation

Update capacity tracking spreadsheet
Update architecture diagrams
Document any special configuration for this node
Update monitoring dashboards

Phase 9: Final Verification

Cluster Health Checklist

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Proxmox cluster
pvecm status              # Quorum, all nodes online
pvecm nodes               # All nodes listed

# Corosync
corosync-quorumtool -s    # Should show expected votes

# Ceph (if applicable)
ceph -s                   # HEALTH_OK (after rebalancing)
ceph osd tree             # New OSDs visible and up

# Tailscale
tailscale status          # Connected, new node visible

# Services
systemctl status pveproxy
systemctl status pvedaemon
systemctl status pvestatd
systemctl status nut-monitor  # If UPS client

Network Connectivity Tests

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# From new node to cluster
ping -c 3 pve001
ping -c 3 10.x.x.44

# From local machine to new node
ping -c 3 10.x.x.XX
ping -c 3 pveXXX.your-tailnet.ts.net

# Test web UI
curl -k https://10.x.x.XX:8006
curl -k https://pveXXX.your-tailnet.ts.net:8006

Test VM/Container Creation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Download a template (if not already available)
pveam update
pveam available
pveam download local debian-12-standard_12.2-1_amd64.tar.zst

# Create test container
pct create 999 local:vztmpl/debian-12-standard_12.2-1_amd64.tar.zst \
  --hostname test \
  --memory 512 \
  --net0 name=eth0,bridge=vmbr0,ip=dhcp \
  --storage local-lvm \
  --rootfs local-lvm:8

# Start it
pct start 999

# Verify
pct status 999
pct exec 999 -- ping -c 3 google.com

# Clean up
pct stop 999
pct destroy 999

Phase 10: Optional GPU Configuration

A. Verify GPU Visibility (If Applicable)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Check PCI devices
lspci | grep -i vga
lspci | grep -i nvidia
lspci | grep -i amd

# Check IOMMU groups
find /sys/kernel/iommu_groups/ -type l

# Load kernel modules
# For AMD
echo "options amdgpu si_support=1 cik_support=1" > /etc/modprobe.d/amdgpu.conf
update-initramfs -u

# For NVIDIA
# See specific GPU passthrough documentation

B. Test GPU Passthrough (If Needed)

Create test LXC with GPU device
Verify device visibility inside container
Test GPU workload (e.g., vainfo for VAAPI)

See: jellyfin-gpu-passthrough-lxc.md for detailed LXC GPU passthrough steps

Rollback Plan

If something goes wrong during cluster join:

Before Cluster Join

Just reinstall Proxmox, start over

After Cluster Join

Cluster join is irreversible! To remove:

Follow the decommission runbook: proxmox-node-decommission-runbook.md
Then reinstall and start over

Common Issues

Issue: Cluster join fails with “no quorum”

Solution:

1
2
3
4
5
# On existing node, check quorum
pvecm expected 1

# May need to temporarily adjust expected votes
pvecm expected <current_nodes_count>

Issue: Time sync issues during join

Solution:

1
2
3
4
5
6
# Ensure time is synchronized
systemctl restart systemd-timesyncd
timedatectl set-ntp true

# Verify times match across cluster
date && ssh pve001 date

Issue: Corosync communication errors

Solution:

1
2
3
4
5
# Check firewall (should allow corosync ports)
# TCP/UDP: 5405-5412, 3121 (pmxcfs)

# Verify multicast is working
corosync-cfgtool -s

Issue: Tailscale hostname conflict (pveXXX-1)

Prevention: Always delete old Tailscale entry before provisioning Solution:

Delete both entries in Tailscale admin
Re-run tailscale up command with correct hostname

Issue: Ceph won’t create OSDs

Common causes:

1
2
3
4
5
6
7
8
9
# Disk has existing partitions
wipefs -a /dev/sdX
sgdisk --zap-all /dev/sdX

# Disk is mounted
umount /dev/sdX*

# Disk has LVM signatures
pvremove /dev/sdX

Issue: Ansible can’t connect after cluster join

Cause: SSH host keys changed during join Solution:

1
2
3
4
5
6
7
# Remove old host key
ssh-keygen -R 10.x.x.XX
ssh-keygen -R pveXXX
ssh-keygen -R pveXXX.your-tailnet.ts.net

# Reconnect to accept new key
ssh root@10.x.x.XX

Post-Onboarding Tasks

Immediate (Same Day)

Monitor cluster health for 1-2 hours
Verify Ceph rebalancing progressing (if applicable)
Test VM migration to/from new node
Verify backup jobs work with new node
Update capacity planning documents

Short Term (Within Week)

Migrate some production workloads to test
Monitor performance and stability
Document any special quirks or configurations
Add to monitoring dashboards
Review and update this runbook based on experience

Long Term

Include node in regular maintenance windows
Monitor hardware health (SMART, temperatures)
Plan capacity utilization

Appendix: Quick Reference Commands

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Cluster status
pvecm status
pvecm nodes
corosync-quorumtool -s

# Ceph status
ceph -s
ceph osd tree
ceph osd df
ceph health detail

# Tailscale
tailscale status
tailscale ping pve001

# Services
systemctl status pveproxy pvedaemon pvestatd
systemctl status nut-monitor

# Ansible
ansible pveXXX -i inventory/homelab.yml -m ping
ansible-playbook -i inventory/homelab.yml site.yml --limit pveXXX --ask-vault-pass --check --diff

# Network tests
ping 10.x.x.1
ping pve001
ping pveXXX.your-tailnet.ts.net

Notes

Timing: Full onboarding takes 2-4 hours depending on:
- Ceph rebalancing time
- Number of updates to install
- GPU/hardware configuration complexity
Planning: Schedule during maintenance window
Backups: Ensure cluster backups are current before adding nodes
Testing: Always test on one node before bulk operations
Documentation: Keep this runbook updated with lessons learned

Changelog

2024-12-04: Initial runbook created
- Added Tailscale hostname conflict prevention
- Added Ansible automation steps
- Added GPU configuration appendix
- Added comprehensive verification checklists

Overview#

Prerequisites#

Planning Phase#

1. Determine Node Specifications#

2. Network Planning#

3. Check Cluster Health Before Adding#

Phase 1: Hardware Preparation#

A. Physical Hardware Setup#

B. Create Installation Media#

Phase 2: Proxmox Installation#

A. Boot from Installation Media#

B. Target Disk Selection#

C. Location and Time Zone#

D. Password and Email#

E. Network Configuration#

F. Complete Installation#

Phase 3: Post-Installation Setup#

A. Initial Login and Validation#

B. Disable Enterprise Repositories (if no subscription)#

C. Update System#

D. Verify Web Interface Access#

Phase 4: Join Proxmox Cluster#

A. Pre-Join Checks#

B. Get Cluster Join Information#

C. Join Cluster#

D. Verify Cluster Membership#

Phase 5: Storage Configuration#

A. Local Storage (Skip if Already Configured)#

B. Ceph OSD Setup (If Using Ceph)#

C. Verify Ceph Health#

Phase 6: Tailscale Setup#

A. Manual Method (For Testing)#

B. Remove Old Tailscale Entry (If Hostname Existed)#

C. Verify Tailscale Connectivity#

Phase 7: Ansible Configuration#

A. Add to Inventory#

B. Test Ansible Connectivity#

C. Run Ansible Playbook (Check Mode First)#

D. Apply Configuration#

E. Verify Ansible Run#

Phase 8: Monitoring and Services#

A. Add to Netdata (If Using)#

B. Verify UPS Monitoring#

C. Update Documentation#

Phase 9: Final Verification#

Cluster Health Checklist#

Network Connectivity Tests#

Test VM/Container Creation#

Phase 10: Optional GPU Configuration#

A. Verify GPU Visibility (If Applicable)#

B. Test GPU Passthrough (If Needed)#

Rollback Plan#

Before Cluster Join#

After Cluster Join#

Common Issues#

Issue: Cluster join fails with “no quorum”#

Issue: Time sync issues during join#

Issue: Corosync communication errors#

Issue: Tailscale hostname conflict (pveXXX-1)#

Issue: Ceph won’t create OSDs#

Issue: Ansible can’t connect after cluster join#

Post-Onboarding Tasks#

Immediate (Same Day)#

Short Term (Within Week)#

Long Term#

Related Documentation#

Appendix: Quick Reference Commands#

Notes#

Changelog#

Overview

Prerequisites

Planning Phase

1. Determine Node Specifications

2. Network Planning

3. Check Cluster Health Before Adding

Phase 1: Hardware Preparation

A. Physical Hardware Setup

B. Create Installation Media

Phase 2: Proxmox Installation

A. Boot from Installation Media

B. Target Disk Selection

C. Location and Time Zone

D. Password and Email

E. Network Configuration

F. Complete Installation

Phase 3: Post-Installation Setup

A. Initial Login and Validation

B. Disable Enterprise Repositories (if no subscription)

C. Update System

D. Verify Web Interface Access

Phase 4: Join Proxmox Cluster

A. Pre-Join Checks

B. Get Cluster Join Information

C. Join Cluster

D. Verify Cluster Membership

Phase 5: Storage Configuration

A. Local Storage (Skip if Already Configured)

B. Ceph OSD Setup (If Using Ceph)

C. Verify Ceph Health

Phase 6: Tailscale Setup

A. Manual Method (For Testing)

B. Remove Old Tailscale Entry (If Hostname Existed)

C. Verify Tailscale Connectivity

Phase 7: Ansible Configuration

A. Add to Inventory

B. Test Ansible Connectivity

C. Run Ansible Playbook (Check Mode First)

D. Apply Configuration

E. Verify Ansible Run

Phase 8: Monitoring and Services

A. Add to Netdata (If Using)

B. Verify UPS Monitoring

C. Update Documentation

Phase 9: Final Verification

Cluster Health Checklist

Network Connectivity Tests

Test VM/Container Creation

Phase 10: Optional GPU Configuration

A. Verify GPU Visibility (If Applicable)

B. Test GPU Passthrough (If Needed)

Rollback Plan

Before Cluster Join

After Cluster Join

Common Issues

Issue: Cluster join fails with “no quorum”

Issue: Time sync issues during join

Issue: Corosync communication errors

Issue: Tailscale hostname conflict (pveXXX-1)

Issue: Ceph won’t create OSDs

Issue: Ansible can’t connect after cluster join

Post-Onboarding Tasks

Immediate (Same Day)

Short Term (Within Week)

Long Term

Related Documentation

Appendix: Quick Reference Commands

Notes

Changelog