How to Add Server and Agent Nodes to an Existing k3s Cluster

Written by: Bagus Facsi Aginsa
Published at: 09 Dec 2023


Your k3s cluster has been running fine on a single machine. Then the host goes down, a kernel panic, a dead disk, a cloud provider maintenance window that ran longer than expected. Every workload stops. You bring the node back and everything recovers, but the question that lingers is: what if it had been unrecoverable? What if it had taken hours instead of minutes?

Adding nodes is how you fix this. A second or third server node gives the control plane the ability to survive a single machine failure. Scheduling continues, the API server stays reachable, running pods keep running. Agent nodes let you spread workloads across machines so one overloaded host cannot take down your entire cluster.

By the end of this tutorial you will have joined a new server node to an existing k3s cluster (extending the control plane for high availability) and joined a new agent node (adding pod capacity). You will also know what the common join failures look like and how to fix them.


Server node vs agent node, which one do you need?

A server node runs the k3s control plane: the API server, scheduler, controller manager, and the datastore client. Every cluster needs at least one. Adding more server nodes makes the control plane highly available — if one server node fails, the others continue handling requests and managing workloads. For true HA with leader election quorum, you need a minimum of three server nodes. Two nodes gives you redundancy against a single crash but not against a split-brain scenario where both nodes lose contact with each other.

An agent node runs the kubelet and kube-proxy only. It does not participate in control plane decisions. Its job is to receive pod scheduling assignments from the server and run them. Add agent nodes when your existing nodes are running out of CPU or memory for pod workloads.

Pick a server node if:

  • Your cluster has a single server node today and you want the control plane to survive a host failure
  • You need the cluster API to stay reachable if one machine goes offline
  • You are building toward a three-node HA setup

Pick an agent node if:

  • Your control plane is already redundant and you need more pod capacity
  • You want to separate control plane machines from worker machines
  • Your existing nodes are hitting CPU or memory limits during scheduling

Why multi-server k3s needs a shared datastore

When you run more than one k3s server node, all of them need to read and write the same cluster state. That state lives in a datastore.

k3s supports two options. Embedded etcd (--cluster-init on the first server node) is the simpler choice, k3s manages its own etcd cluster across your server nodes with no external dependency. External datastore (MySQL, PostgreSQL, or an external etcd cluster passed via --datastore-endpoint) separates the storage layer from the compute layer, making it easier to back up and manage with your existing database tooling.

The cluster in this tutorial was initialized with an external MySQL datastore. If you have not set that up yet, start with How To Install k3s With Calico And External MySQL Database first. The server node join requires every new server to reach that same datastore. Agent nodes do not connect to the datastore at all.


Prerequisites

  1. Ubuntu 20.04 or later on the new node
  2. An existing k3s cluster with at least one server node running
  3. The new node can reach the k3s API endpoint on port 6443.
  4. For a new server node: the new node can also reach the external datastore.
  5. No firewall blocking port 6443 between nodes; if using embedded etcd, also open ports 2379 and 2380 between server nodes
  6. sudo access on both the existing cluster node and the new node

Sudo privileges

Before starting, switch to root on the new node so you do not hit permission errors during installation:

sudo su

Get the existing cluster configuration

These steps are done on the existing k3s server (master) node.

Confirm the cluster is running:

kubectl get node
NAME           STATUS   ROLES                  AGE   VERSION
k3s-server-1   Ready    control-plane,master   30m   v1.28.4+k3s2

Open the k3s service file to find the flags the cluster was started with:

nano /etc/systemd/system/k3s.service
EnvironmentFile=-/etc/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s \
    server \
        '--cluster-init' \
        '--tls-san=k3s_endpoint' \
        '--datastore-endpoint=mysql://k3s_user:K3S123@tcp(mysql_endpoint:3306)/k3s_database' \
        '--flannel-backend=none' \
        '--cluster-cidr=192.168.0.0/16' \
        '--disable-network-policy' \
        '--disable=traefik' \

The highlighted flags are everything you need for the server join command. Every server node that joins must pass the same set of flags. Copy them now.

Note: k3s_endpoint and mysql_endpoint are domain names in this cluster. Run a ping to find the IP behind each name before moving to the new node:

root@k3s-server-1:/var/lib/rancher/k3s# ping k3s_endpoint
PING k3s_endpoint (127.0.0.1) 56(84) bytes of data.

root@k3s-server-1:/var/lib/rancher/k3s# ping mysql_endpoint
PING k3s_endpoint (127.0.0.1) 56(84) bytes of data.

In this case both names point to localhost on the master, which means the real IP is the master node’s network interface address, not 127.0.0.1. You will need that address when you configure /etc/hosts on the new node.

Now get the cluster token:

cat /var/lib/rancher/k3s/server/token
K103ffe438192e539d52726e65e55d2a4eaa92023428ae522a2537709251a218339::server:K3SC11T

The part after the last colon is the token. In this example: K3SC11T.


Join a server node

These steps are done on the new k3s server node.

If your k3s_endpoint and mysql_endpoint are domain names, add them to /etc/hosts on the new node. Open the file:

nano /etc/hosts

Add the IP addresses you found in the previous section:

10.57.149.209 k3s_endpoint
10.57.149.209 mysql_endpoint

Use the master node’s network interface address, not the 127.0.0.1 you saw from the ping on the master itself. The 10.57.149.209 here is my master’s actual IP.

Note: If your k3s_endpoint and mysql_endpoint are already IP addresses, skip the /etc/hosts step.

Confirm the names resolve correctly from the new node:

root@k3s-server-2:/home/bagus# ping mysql_endpoint
PING mysql_endpoint (10.57.149.209) 56(84) bytes of data.

root@k3s-server-2:/home/bagus# ping k3s_endpoint
PING k3s_endpoint (10.57.149.209) 56(84) bytes of data.

Verify the new node can reach the datastore:

nc -vz mysql_endpoint 3306

If this fails, check two things: a firewall rule may be blocking port 3306, or the database is bound to 127.0.0.1 only and not accepting remote connections.

Now run the join command. Use the token and every configuration flag you copied from the existing server:

curl -sfL https://get.k3s.io | sh -s - server \
  --token=K3SC11T \
  --tls-san=k3s_endpoint \
  --datastore-endpoint="mysql://k3s_user:K3S123@tcp(mysql_endpoint:3306)/k3s_database" \
  --flannel-backend=none \
  --cluster-cidr=192.168.0.0/16 \
  --disable-network-policy \
  --disable=traefik

Watch the control plane pods come up. This takes a few minutes the first time:

kubectl get pod --all-namespaces --watch

Once pods are running, confirm both nodes appear:

root@k3s-server-2:/home/bagus# kubectl get node
NAME           STATUS   ROLES                  AGE     VERSION
k3s-server-1   Ready    control-plane,master   10h     v1.28.4+k3s2
k3s-server-2   Ready    control-plane,master   4m19s   v1.28.4+k3s2

Both nodes show control-plane,master under ROLES. The control plane is now distributed across two machines.


Join an agent node

These steps are done on the new k3s agent node.

An agent node only needs the cluster token and the k3s API endpoint. It does not need the datastore connection string.

If k3s_endpoint is a domain name, add it to /etc/hosts on the new agent node:

nano /etc/hosts
10.57.149.209 k3s_endpoint

Note: If k3s_endpoint is already an IP address, skip this step.

Run the agent join command:

curl -sfL https://get.k3s.io | K3S_URL=https://k3s_endpoint:6443 K3S_TOKEN=K3SC11T sh -

After a moment, confirm the new node appears in the cluster. Run this on one of the existing server nodes:

root@k3s-server-1:/home/bagus# k get node
NAME           STATUS   ROLES                  AGE     VERSION
k3s-server-1   Ready    control-plane,master   11h     v1.28.4+k3s2
k3s-server-2   Ready    <none>                 2m31s   v1.28.4+k3s2

The agent node shows <none> under ROLES, that is expected. Only server nodes carry the control-plane,master label. The new node is ready to receive pod workloads. You can apply network policies to control traffic between pods across all nodes, and expose workloads on the new node using Ingress or the Gateway API.


Common errors and fixes

x509: certificate signed by unknown authority when joining

The new node cannot verify the API server’s TLS certificate. This almost always means the hostname or IP the new node is connecting to was not included in the --tls-san flag when the cluster was first initialized. Check --tls-san in /etc/systemd/system/k3s.service on the master, it must include every name and IP address that clients use to reach the cluster endpoint. If the SAN list is missing your new node’s connection address, you need to add it and rotate the server certificates.

Token rejected / “failed to get CA certs”

The token you passed in --token or K3S_TOKEN does not match what is in /var/lib/rancher/k3s/server/token on the existing server. Re-read the token on the master and re-run the join command. The full token string (everything including the K103...::server: prefix) is valid and accepted, but copy it exactly, including case.

Agent joins but stays NotReady

The node appears in kubectl get node but never transitions to Ready. Usually the CNI plugin (Calico in this cluster) has not come up on the new node yet, or the pod running the CNI agent there is stuck. Check with:

kubectl get pod --all-namespaces -o wide | grep <new-node-name>

Look for CNI-related pods in CrashLoopBackOff or Pending. A missing network plugin means the kubelet cannot mark the node ready. Describe the stuck pod to get the actual error message.

nc -vz to the datastore fails

You cannot reach mysql_endpoint:3306 from the new server node. Two causes: a firewall is blocking port 3306 between the new node and the database host, or the database is bound to 127.0.0.1 only. On the database server, run SHOW VARIABLES LIKE 'bind_address'; in MySQL. If it returns 127.0.0.1, update the bind address to 0.0.0.0 and restart MySQL, then restrict access with a firewall or MySQL user grants rather than relying on bind address for security.

Datastore reachable but k3s server fails to start

k3s connects to the datastore but then exits. Check the journal for the actual error:

journalctl -u k3s -f

The most common cause is a port conflict, another process is already using port 6443, or the --tls-san hostname resolves to a loopback address on the new node and k3s cannot bind. Use ss -tlnp | grep 6443 to check the port and ping k3s_endpoint on the new node to confirm the resolution is what you expect.

Node joins but kubectl get node does not show it on the new node

The node appeared in kubectl get node on the master but running the same command on the new server node returns an error or an empty list. On a k3s server node, the kubeconfig is written to /etc/rancher/k3s/k3s.yaml. If your shell session predates the join, the variable may not be set:

export KUBECONFIG=/etc/rancher/k3s/k3s.yaml

Agent nodes do not have kubectl available by default. To use kubectl on an agent, copy the kubeconfig from any server node and set KUBECONFIG to point at it.

Firewall blocking port 6443 or 2379/2380

Nodes can resolve DNS and reach each other by ping, but the join hangs or times out. Run nc -vz k3s_endpoint 6443 from the new node. If it hangs, a firewall is blocking the k3s API port. On Ubuntu with UFW:

ufw allow 6443/tcp

If you are using embedded etcd rather than an external datastore, also open the etcd peer ports on every server node:

ufw allow 2379/tcp
ufw allow 2380/tcp

FAQ

Can I join a node without an external datastore?

Yes. If your first server node was started with --cluster-init and no --datastore-endpoint, k3s uses embedded etcd for cluster state. Joining additional server nodes in this case still requires passing --cluster-init in the join command alongside your token and other flags. Agent nodes do not interact with the datastore at all, they only need K3S_URL and K3S_TOKEN.

How do I remove a node from a k3s cluster?

Drain it first so its workloads reschedule elsewhere: kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data. Then delete the record: kubectl delete node <node-name>. Finally, on the node itself run /usr/local/bin/k3s-uninstall.sh for server nodes or /usr/local/bin/k3s-agent-uninstall.sh for agent nodes to remove the k3s installation and clean up.

What is the difference between K3S_URL and the --server flag?

They are equivalent. K3S_URL is an environment variable used when running the install script inline, the format used in the agent join command above. --server is the same value passed as an explicit flag to k3s agent or k3s server. Both accept https://<endpoint>:6443 as the value. Use whichever form fits how you are running the install command.

Does the k3s join token expire?

No. The token in /var/lib/rancher/k3s/server/token does not have an expiry. It is a static secret that stays valid until you manually rotate it. Treat it like a password: if it is ever exposed, replace the token file on the server, restart k3s on all server nodes, and rejoin any nodes that used the old token.

Can I join a node to a k3s cluster running on a different OS version?

Generally yes. k3s does not depend on a specific Ubuntu version beyond the kernel requirements for cgroups and iptables support. Joining an Ubuntu 24.04 node to a cluster originally built on Ubuntu 20.04 works in practice. Check the k3s documentation for the minimum supported kernel version if you are using an older distribution.

How many server nodes do I need for high availability?

Three. With two server nodes you have redundancy but not quorum, if one fails, the embedded etcd cluster loses its majority and the control plane stops accepting writes. Three nodes means any single node can fail while the remaining two maintain quorum and the API server keeps running. Going to five nodes tolerates two simultaneous failures, but for most clusters three is sufficient.