Overview

Step-by-step guide for adding a new Proxmox node to the cluster, including hardware preparation, cluster join, Ceph configuration, Tailscale, and Ansible automation.

Prerequisites

  • Physical or virtual hardware ready
  • Network access to existing cluster nodes
  • Proxmox VE ISO (latest stable version)
  • Cluster network details (IPs, gateway, DNS)
  • Access to vault password for Ansible
  • Access to Tailscale admin console

Planning Phase

1. Determine Node Specifications

1
2
3
4
5
6
7
Hostname: pveXXX (e.g., pve009)
Management IP: 10.x.x.XX
Purpose: General compute / GPU workload / Storage / etc.
RAM: XXX GB
CPU: Model and core count
Storage: Disk configuration plan
GPU: If applicable

2. Network Planning

1
2
3
4
5
6
# Determine next available IP in cluster range
# Check existing nodes
ansible all -i inventory/homelab.yml -m shell -a "hostname -I | awk '{print \$1}'"

# Or check manually
for i in {40..60}; do ping -c 1 -W 1 10.x.x.$i &>/dev/null && echo "10.x.x.$i - UP" || echo "10.x.x.$i - available"; done

3. Check Cluster Health Before Adding

1
2
3
4
5
# On existing cluster node
pvecm status
ceph -s  # If using Ceph

# Ensure cluster is healthy before proceeding

Phase 1: Hardware Preparation

A. Physical Hardware Setup

  • Install hardware in rack/location
  • Connect power (note: consider UPS connectivity)
  • Connect network cables
  • Verify BIOS settings:
    • Virtualization enabled (VT-x/AMD-V)
    • IOMMU enabled (if doing GPU passthrough)
    • Boot order (network/USB/disk as needed)
    • Power settings (restore on AC power loss)

B. Create Installation Media

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Download latest Proxmox VE ISO
# https://www.proxmox.com/en/downloads

# Verify checksum
sha256sum proxmox-ve_*.iso

# Create bootable USB (macOS)
sudo dd if=proxmox-ve_*.iso of=/dev/diskX bs=1m status=progress

# Or use balenaEtcher, Rufus, etc.

Phase 2: Proxmox Installation

A. Boot from Installation Media

  1. Boot from USB/DVD
  2. Select “Install Proxmox VE”
  3. Accept EULA

B. Target Disk Selection

  • Select installation disk (usually smallest SSD)
  • Important: Note which disks you’re saving for Ceph/storage
  • Consider ZFS for root disk if desired

C. Location and Time Zone

1
2
3
Country: United States (or appropriate)
Time zone: America/Los_Angeles (or appropriate)
Keyboard Layout: en-us

D. Password and Email

1
2
Root password: <use strong password, store in vault>
Email: your-email@example.com (or appropriate)

E. Network Configuration

1
2
3
4
5
Management Interface: Choose primary NIC (usually eno1, ens18, etc.)
Hostname (FQDN): pveXXX.localdomain (e.g., pve009.localdomain)
IP Address: 10.x.x.XX/24
Gateway: 10.x.x.1
DNS Server: 10.x.x.1 (or appropriate)

Important: Double-check IP is not in use!

F. Complete Installation

  • Review summary
  • Click “Install”
  • Wait for installation to complete (~10 minutes)
  • Remove installation media
  • Reboot

Phase 3: Post-Installation Setup

A. Initial Login and Validation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# SSH to new node
ssh root@10.x.x.XX

# Verify basic connectivity
ping -c 3 8.8.8.8
ping -c 3 google.com

# Check time sync (critical for cluster)
timedatectl status
# Should show: System clock synchronized: yes

# If not synced
systemctl enable --now systemd-timesyncd
timedatectl set-ntp true

B. Disable Enterprise Repositories (if no subscription)

1
2
3
4
5
# This is automated by Ansible, but good to verify manually first
cat /etc/apt/sources.list.d/pve-enterprise.list
cat /etc/apt/sources.list.d/ceph.list

# Should see they're commented out or have the right repos

C. Update System

1
2
3
4
5
apt update
apt dist-upgrade -y

# May require reboot
reboot

D. Verify Web Interface Access

1
2
3
4
https://10.x.x.XX:8006

Login: root
Password: <root password>

Phase 4: Join Proxmox Cluster

Critical: Joining a cluster is irreversible without reinstalling. Make sure you’re ready!

A. Pre-Join Checks

On new node:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Ensure no VMs or containers exist
qm list    # Should be empty
pct list   # Should be empty

# Verify time sync with cluster
ssh root@pve001 "date +%s" && date +%s
# Times should be within 1-2 seconds

# Verify cluster communication
ping -c 3 pve001
ssh root@pve001 "pvecm status"

B. Get Cluster Join Information

On existing cluster node:

1
2
3
4
5
pvecm status
# Note the cluster name

# Get join information
pvecm add --help

C. Join Cluster

On new node:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Join cluster (use IP of existing node)
pvecm add 10.x.x.44

# You'll be prompted for:
# - Root password of existing node
# - Confirmation

# This will:
# - Configure corosync
# - Restart networking
# - Join cluster
# - Regenerate SSH host keys

Note: SSH connection will drop during join. Wait ~30 seconds, then reconnect.

D. Verify Cluster Membership

On any cluster node:

1
2
3
4
5
6
7
8
9
pvecm status
# Should show new node in member list

pvecm nodes
# Should list all nodes including new one

# Check quorum
pvecm expected 1
# Should show expected votes and quorum info

On new node:

1
2
3
4
5
# Verify you can see cluster storage
pvesm status

# Verify you can see other nodes
pvesh get /nodes

Phase 5: Storage Configuration

A. Local Storage (Skip if Already Configured)

1
2
3
4
# Check existing storage
pvesm status

# Local storage should already exist from installation

B. Ceph OSD Setup (If Using Ceph)

Pre-requisites:

  • At least 3 nodes in cluster for redundancy
  • Dedicated disks for Ceph (separate from OS disk)
  • Network connectivity to other Ceph nodes

On new node:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Install Ceph packages (if not already)
pveceph install

# Create Ceph monitor (if cluster needs more)
pveceph mon create

# Create Ceph manager
pveceph mgr create

# List available disks
lsblk
pveceph osd list

# Create OSDs for each data disk (replace /dev/sdX)
pveceph osd create /dev/sdb
pveceph osd create /dev/sdc
pveceph osd create /dev/sdd

C. Verify Ceph Health

1
2
3
4
5
6
7
8
ceph -s
# Should show: HEALTH_OK (after rebalancing completes)

ceph osd tree
# Should show new OSDs in topology

ceph osd df
# Should show OSDs with appropriate size

Note: Ceph will rebalance data. This can take hours or days depending on cluster size.

Phase 6: Tailscale Setup

A. Manual Method (For Testing)

1
2
3
4
5
6
7
8
# On new node
curl -fsSL https://tailscale.com/install.sh | sh

# Authenticate (use auth key from Tailscale admin)
tailscale up --authkey=tskey-auth-XXXXX --hostname=pveXXX --advertise-tags=tag:proxmox --ssh

# Verify
tailscale status

B. Remove Old Tailscale Entry (If Hostname Existed)

  1. Go to https://login.tailscale.com/admin/machines
  2. Search for pveXXX
  3. Delete any old entries with same hostname
  4. Re-run tailscale up command above

C. Verify Tailscale Connectivity

1
2
3
4
5
6
7
8
# From your local machine
ping pveXXX.your-tailnet.ts.net

# SSH via Tailscale
ssh root@pveXXX.your-tailnet.ts.net

# Test web UI
# Open: https://pveXXX.your-tailnet.ts.net:8006

Phase 7: Ansible Configuration

A. Add to Inventory

1
2
3
4
5
6
7
# Edit inventory file
vi ansible/inventory/homelab.yml

# Add new node under proxmox > nut_netclients > hosts
# Example:
#     pve009:
#       ansible_host: 10.x.x.49

B. Test Ansible Connectivity

1
2
3
4
5
6
7
# From your local machine in ansible directory
cd ~/8do/lab/ansible

# Ping test
ansible pve009 -i inventory/homelab.yml -m ping

# Should return: pong

C. Run Ansible Playbook (Check Mode First)

1
2
3
4
# Dry run to see what will change
ansible-playbook -i inventory/homelab.yml site.yml --limit pve009 --ask-vault-pass --check --diff

# Review output carefully

D. Apply Configuration

1
2
3
4
5
6
7
8
9
# Apply for real
ansible-playbook -i inventory/homelab.yml site.yml --limit pve009 --ask-vault-pass

# This will:
# - Configure repositories
# - Install packages
# - Configure Tailscale (update tags, setup SSH)
# - Request Tailscale TLS certificates for web UI
# - Configure NUT UPS client

E. Verify Ansible Run

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Check Tailscale certificate
ssh root@pve009 "ls -la /etc/pve/local/pveproxy-ssl.*"

# Verify NUT client
ssh root@pve009 "systemctl status nut-monitor"
ssh root@pve009 "upsc myups@10.x.x.44"

# Verify web UI with Tailscale cert
# Open: https://pve009.your-tailnet.ts.net:8006
# Should show valid cert (no browser warning)

Phase 8: Monitoring and Services

A. Add to Netdata (If Using)

1
2
3
4
5
# Run Netdata playbook
ansible-playbook -i inventory/homelab.yml netdata_install.yml --limit pve009 --ask-vault-pass

# Verify in Netdata Cloud
# https://app.netdata.cloud/

B. Verify UPS Monitoring

1
2
3
4
5
6
7
# On new node
upsc myups@10.x.x.44

# Should show UPS status

# Check logs
journalctl -u nut-monitor -f

C. Update Documentation

  • Update capacity tracking spreadsheet
  • Update architecture diagrams
  • Document any special configuration for this node
  • Update monitoring dashboards

Phase 9: Final Verification

Cluster Health Checklist

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Proxmox cluster
pvecm status              # Quorum, all nodes online
pvecm nodes               # All nodes listed

# Corosync
corosync-quorumtool -s    # Should show expected votes

# Ceph (if applicable)
ceph -s                   # HEALTH_OK (after rebalancing)
ceph osd tree             # New OSDs visible and up

# Tailscale
tailscale status          # Connected, new node visible

# Services
systemctl status pveproxy
systemctl status pvedaemon
systemctl status pvestatd
systemctl status nut-monitor  # If UPS client

Network Connectivity Tests

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# From new node to cluster
ping -c 3 pve001
ping -c 3 10.x.x.44

# From local machine to new node
ping -c 3 10.x.x.XX
ping -c 3 pveXXX.your-tailnet.ts.net

# Test web UI
curl -k https://10.x.x.XX:8006
curl -k https://pveXXX.your-tailnet.ts.net:8006

Test VM/Container Creation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Download a template (if not already available)
pveam update
pveam available
pveam download local debian-12-standard_12.2-1_amd64.tar.zst

# Create test container
pct create 999 local:vztmpl/debian-12-standard_12.2-1_amd64.tar.zst \
  --hostname test \
  --memory 512 \
  --net0 name=eth0,bridge=vmbr0,ip=dhcp \
  --storage local-lvm \
  --rootfs local-lvm:8

# Start it
pct start 999

# Verify
pct status 999
pct exec 999 -- ping -c 3 google.com

# Clean up
pct stop 999
pct destroy 999

Phase 10: Optional GPU Configuration

A. Verify GPU Visibility (If Applicable)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Check PCI devices
lspci | grep -i vga
lspci | grep -i nvidia
lspci | grep -i amd

# Check IOMMU groups
find /sys/kernel/iommu_groups/ -type l

# Load kernel modules
# For AMD
echo "options amdgpu si_support=1 cik_support=1" > /etc/modprobe.d/amdgpu.conf
update-initramfs -u

# For NVIDIA
# See specific GPU passthrough documentation

B. Test GPU Passthrough (If Needed)

  • Create test LXC with GPU device
  • Verify device visibility inside container
  • Test GPU workload (e.g., vainfo for VAAPI)

See: jellyfin-gpu-passthrough-lxc.md for detailed LXC GPU passthrough steps

Rollback Plan

If something goes wrong during cluster join:

Before Cluster Join

  • Just reinstall Proxmox, start over

After Cluster Join

Cluster join is irreversible! To remove:

  • Follow the decommission runbook: proxmox-node-decommission-runbook.md
  • Then reinstall and start over

Common Issues

Issue: Cluster join fails with “no quorum”

Solution:

1
2
3
4
5
# On existing node, check quorum
pvecm expected 1

# May need to temporarily adjust expected votes
pvecm expected <current_nodes_count>

Issue: Time sync issues during join

Solution:

1
2
3
4
5
6
# Ensure time is synchronized
systemctl restart systemd-timesyncd
timedatectl set-ntp true

# Verify times match across cluster
date && ssh pve001 date

Issue: Corosync communication errors

Solution:

1
2
3
4
5
# Check firewall (should allow corosync ports)
# TCP/UDP: 5405-5412, 3121 (pmxcfs)

# Verify multicast is working
corosync-cfgtool -s

Issue: Tailscale hostname conflict (pveXXX-1)

Prevention: Always delete old Tailscale entry before provisioning Solution:

  1. Delete both entries in Tailscale admin
  2. Re-run tailscale up command with correct hostname

Issue: Ceph won’t create OSDs

Common causes:

1
2
3
4
5
6
7
8
9
# Disk has existing partitions
wipefs -a /dev/sdX
sgdisk --zap-all /dev/sdX

# Disk is mounted
umount /dev/sdX*

# Disk has LVM signatures
pvremove /dev/sdX

Issue: Ansible can’t connect after cluster join

Cause: SSH host keys changed during join Solution:

1
2
3
4
5
6
7
# Remove old host key
ssh-keygen -R 10.x.x.XX
ssh-keygen -R pveXXX
ssh-keygen -R pveXXX.your-tailnet.ts.net

# Reconnect to accept new key
ssh root@10.x.x.XX

Post-Onboarding Tasks

Immediate (Same Day)

  • Monitor cluster health for 1-2 hours
  • Verify Ceph rebalancing progressing (if applicable)
  • Test VM migration to/from new node
  • Verify backup jobs work with new node
  • Update capacity planning documents

Short Term (Within Week)

  • Migrate some production workloads to test
  • Monitor performance and stability
  • Document any special quirks or configurations
  • Add to monitoring dashboards
  • Review and update this runbook based on experience

Long Term

  • Include node in regular maintenance windows
  • Monitor hardware health (SMART, temperatures)
  • Plan capacity utilization

Appendix: Quick Reference Commands

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Cluster status
pvecm status
pvecm nodes
corosync-quorumtool -s

# Ceph status
ceph -s
ceph osd tree
ceph osd df
ceph health detail

# Tailscale
tailscale status
tailscale ping pve001

# Services
systemctl status pveproxy pvedaemon pvestatd
systemctl status nut-monitor

# Ansible
ansible pveXXX -i inventory/homelab.yml -m ping
ansible-playbook -i inventory/homelab.yml site.yml --limit pveXXX --ask-vault-pass --check --diff

# Network tests
ping 10.x.x.1
ping pve001
ping pveXXX.your-tailnet.ts.net

Notes

  • Timing: Full onboarding takes 2-4 hours depending on:
    • Ceph rebalancing time
    • Number of updates to install
    • GPU/hardware configuration complexity
  • Planning: Schedule during maintenance window
  • Backups: Ensure cluster backups are current before adding nodes
  • Testing: Always test on one node before bulk operations
  • Documentation: Keep this runbook updated with lessons learned

Changelog

  • 2024-12-04: Initial runbook created
    • Added Tailscale hostname conflict prevention
    • Added Ansible automation steps
    • Added GPU configuration appendix
    • Added comprehensive verification checklists