Kubernetes Node Pools and Networking

Two things that trip up most engineers when running Kubernetes in production — node pools and networking. You can get a cluster running without fully understanding either, but when something breaks or you need to scale, you will wish you had.

This guide covers both — what node pools are, how to set them up on AWS, GCP, Azure, and DigitalOcean, and how nodes communicate with each other inside a cluster.

Part 1 — Node Pools

What Is a Node Pool?

A node pool is a group of nodes in a Kubernetes cluster that all have the same configuration — same machine type, same CPU, same RAM, same operating system.

A cluster can have multiple node pools. Each pool can have a different configuration. This lets you run different types of workloads on different hardware.

Example:

Pool 1 — 3 general-purpose nodes (4 CPU, 8GB RAM) — for web and API workloads
Pool 2 — 2 memory-optimised nodes (4 CPU, 32GB RAM) — for databases
Pool 3 — 5 spot/preemptible nodes — for batch jobs and CI/CD runners

Each pool can autoscale independently. You scale the batch pool during heavy CI runs without touching the production pool.

Why Use Multiple Node Pools?

1. Workload isolation You do not want your batch jobs competing for resources with your production API. Put them on separate pools.

2. Cost optimisation Run non-critical workloads on spot or preemptible instances (60-90% cheaper). Keep production on on-demand nodes.

3. Different hardware requirements Some workloads need more memory, others need GPU, others just need cheap CPU. Match the hardware to the workload.

4. Independent autoscaling Each pool scales independently based on actual demand. Your GPU pool does not scale up just because your API has a spike.

Node Pool Commands — Common Operations

These commands work across cloud providers once you have kubectl configured.

List all nodes and their pools:

kubectl get nodes --show-labels

Check node pool labels:

kubectl get nodes -L cloud.google.com/gke-nodepool
# or for EKS
kubectl get nodes -L eks.amazonaws.com/nodegroup

Cordon a node (prevent new pods from scheduling):

kubectl cordon &lt;node-name>

Drain a node (move pods off before maintenance):

kubectl drain &lt;node-name> --ignore-daemonsets --delete-emptydir-data

Check node resource usage:

kubectl top nodes

Node Pools on AWS EKS

AWS calls node pools “managed node groups.” You can create them using eksctl or Terraform.

Create a node group with eksctl:

# nodegroup.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: my-cluster
  region: us-east-1

managedNodeGroups:
  - name: general-pool
    instanceType: t3.medium
    minSize: 2
    maxSize: 10
    desiredCapacity: 3
    labels:
      workload: general
    tags:
      environment: production

  - name: spot-pool
    instanceTypes:
      - t3.medium
      - t3a.medium
      - t2.medium
    capacityType: SPOT
    minSize: 1
    maxSize: 20
    desiredCapacity: 3
    labels:
      workload: batch
      capacity-type: spot
    taints:
      - key: spot
        value: "true"
        effect: NoSchedule

Apply it:

eksctl create nodegroup -f nodegroup.yaml

Key points for EKS:

Control plane costs $0.10/hour per cluster (~$73/month) — this is charged regardless of node count
Spot instances on EKS can save 60-90% — use multiple instance types for better availability
Managed node groups handle OS patching and node replacement automatically

Node Pools on GCP GKE

GKE has the most mature node pool support. You can create pools via gcloud or through the Console.

Create a node pool with gcloud:

# General purpose pool
gcloud container node-pools create general-pool \
  --cluster my-cluster \
  --region us-central1 \
  --machine-type n2-standard-4 \
  --num-nodes 3 \
  --enable-autoscaling \
  --min-nodes 2 \
  --max-nodes 10 \
  --node-labels workload=general

# Spot instance pool for batch workloads
gcloud container node-pools create spot-pool \
  --cluster my-cluster \
  --region us-central1 \
  --machine-type n2-standard-4 \
  --spot \
  --num-nodes 3 \
  --enable-autoscaling \
  --min-nodes 0 \
  --max-nodes 20 \
  --node-labels workload=batch,capacity-type=spot \
  --node-taints spot=true:NoSchedule

Key points for GKE:

Control plane is free for standard clusters (zonal) — you only pay for nodes
GKE Autopilot mode automatically manages node pools for you — you just define pods
Preemptible VMs (GCP’s spot instances) run for a maximum of 24 hours
GKE has the best multi-cluster management of any provider via Fleet

Node Pools on Azure AKS

Azure calls them “node pools” or “agent pools.” The system node pool is required and runs system components. User node pools run your workloads.

Create a node pool with Azure CLI:

# General purpose user node pool
az aks nodepool add \
  --resource-group my-rg \
  --cluster-name my-cluster \
  --name generalpool \
  --node-count 3 \
  --node-vm-size Standard_D4s_v3 \
  --enable-cluster-autoscaler \
  --min-count 2 \
  --max-count 10 \
  --labels workload=general

# Spot instance pool
az aks nodepool add \
  --resource-group my-rg \
  --cluster-name my-cluster \
  --name spotpool \
  --node-count 3 \
  --node-vm-size Standard_D2s_v3 \
  --priority Spot \
  --eviction-policy Delete \
  --spot-max-price -1 \
  --enable-cluster-autoscaler \
  --min-count 1 \
  --max-count 20 \
  --labels workload=batch capacity-type=spot \
  --node-taints "spot=true:NoSchedule"

Using Terraform for AKS node pools:

resource "azurerm_kubernetes_cluster_node_pool" "spot" {
  name                  = "spotpool"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_D2s_v3"
  priority              = "Spot"
  eviction_policy       = "Delete"
  spot_max_price        = -1
  enable_auto_scaling   = true
  min_count             = 1
  max_count             = 20

  node_labels = {
    "workload"       = "batch"
    "capacity-type"  = "spot"
  }

  node_taints = ["spot=true:NoSchedule"]
}

Key points for AKS:

Control plane is free
System node pool cannot be deleted — it runs kube-system components
Spot VMs on AKS can be evicted with 30 seconds notice — design workloads to tolerate this
Azure Spot Eviction Handler should be installed for graceful pod termination

Node Pools on DigitalOcean DOKS

DigitalOcean Kubernetes is the simplest to set up. No control plane fee — you only pay for the Droplets (nodes).

Create a node pool with doctl:

# Create cluster with initial pool
doctl kubernetes cluster create my-cluster \
  --region nyc1 \
  --version 1.29.1-do.0 \
  --node-pool "name=general-pool;size=s-2vcpu-4gb;count=3;auto-scale=true;min-nodes=2;max-nodes=10"

# Add a second pool to an existing cluster
doctl kubernetes cluster node-pool create my-cluster \
  --name backend-pool \
  --size s-4vcpu-8gb \
  --count 2 \
  --auto-scale \
  --min-nodes 1 \
  --max-nodes 5 \
  --tag backend

Using Terraform for DOKS node pools:

resource "digitalocean_kubernetes_cluster" "main" {
  name    = "my-cluster"
  region  = "nyc1"
  version = "1.29.1-do.0"

  node_pool {
    name       = "general-pool"
    size       = "s-2vcpu-4gb"
    node_count = 3
    auto_scale = true
    min_nodes  = 2
    max_nodes  = 10
    labels = {
      workload = "general"
    }
  }
}

resource "digitalocean_kubernetes_node_pool" "backend" {
  cluster_id = digitalocean_kubernetes_cluster.main.id
  name       = "backend-pool"
  size       = "s-4vcpu-8gb"
  node_count = 2
  auto_scale = true
  min_nodes  = 1
  max_nodes  = 5
  labels = {
    workload = "backend"
  }
}

Key points for DOKS:

No control plane fee — best value for small to medium clusters
Node sizes range from $12/month (1 vCPU, 2GB) to GPU Droplets
Load balancers cost $12/month each — use an Ingress controller to share one across services
Bandwidth is mostly included — no surprise data transfer charges

Scheduling Pods on Specific Node Pools

Once you have multiple pools, you need to tell Kubernetes which pods go where.

Using nodeSelector (simple):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-job
spec:
  template:
    spec:
      nodeSelector:
        workload: batch
      containers:
      - name: worker
        image: myapp:latest

Using nodeAffinity (more flexible):

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: workload
            operator: In
            values:
            - batch

Using tolerations for spot nodes:

If a node pool has a taint (like spot=true:NoSchedule), pods that do not tolerate the taint will not be scheduled there. Add a toleration to allow scheduling:

spec:
  tolerations:
  - key: "spot"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: capacity-type
            operator: In
            values:
            - spot

This toleration allows the pod to run on spot nodes. The affinity prefers spot nodes but does not require them — if no spot node is available, the pod runs on an on-demand node.

Node Pool Cost Comparison

Provider	Control Plane	Node Cost (2 vCPU, 4GB)	Load Balancer	Spot Savings
AWS EKS	$73/month	~$120/month (on-demand)	$40/month	60-90%
GCP GKE	Free (zonal)	~$100/month (on-demand)	$45/month	60-91%
Azure AKS	Free	~$88/month (on-demand)	$35/month	60-90%
DigitalOcean DOKS	Free	$24/month	$12/month	N/A

For small teams and startups: DigitalOcean DOKS wins on simplicity and cost. For enterprise and complex workloads: GKE wins on features and multi-cluster management.

Part 2 — Node Networking

How Nodes Communicate

Kubernetes imposes two networking rules that must always be true:

Every pod can communicate with every other pod on any node — without NAT
Agents on a node (kubelet, kube-proxy) can communicate with all pods on that node

Kubernetes does not implement this itself. It delegates to a CNI (Container Network Interface) plugin.

What Is a CNI Plugin?

CNI is a standard that defines how network plugins connect containers to the network. When a pod is created, the kubelet calls the CNI plugin to:

Create a network interface inside the pod’s network namespace
Assign an IP address to the pod
Set up routing so the pod can communicate with other pods and services

When the pod is deleted, the CNI plugin tears down the network interface and releases the IP.

Each node gets a subnet (CIDR block) assigned to it. All pods on that node get IPs from that subnet.

Node 1 — 10.244.0.0/24 — Pods get IPs like 10.244.0.5, 10.244.0.6
Node 2 — 10.244.1.0/24 — Pods get IPs like 10.244.1.5, 10.244.1.6

How Pod-to-Pod Communication Works Across Nodes

When a pod on Node 1 sends a packet to a pod on Node 2, one of two things happens depending on your CNI:

Option 1 — Overlay networking (VXLAN)

The packet is encapsulated inside another packet. The outer packet is addressed to Node 2’s IP. Node 2 receives it, unwraps it, and delivers it to the correct pod.

This is how Flannel and Calico in VXLAN mode work. It is simple and works everywhere, but adds encapsulation overhead.

Pod A (10.244.0.5) → [VXLAN encapsulation] → Node 2 (192.168.1.2) → Pod B (10.244.1.5)

Option 2 — Direct routing (BGP)

No encapsulation. Routes for each node’s pod subnet are distributed to the network. Traffic flows directly between nodes like any other IP routing.

This is how Calico in BGP mode and Cilium in native routing mode work. Better performance, but requires the underlying network to support it.

Pod A (10.244.0.5) → [direct route] → Node 2 (192.168.1.2) → Pod B (10.244.1.5)

CNI Plugins — Which One to Use

CNI Plugin	Mode	Best For	Default On
Flannel	VXLAN overlay	Simple clusters, easy setup	Some bare-metal distros
Calico	VXLAN or BGP	Network policies, enterprise	EKS (VPC CNI optional), on-prem
Cilium	eBPF-based	Performance, security, observability	GKE (Dataplane V2), AKS optional
AWS VPC CNI	Native VPC routing	AWS EKS	EKS (default)
Azure CNI	Azure VNet overlay	Azure AKS	AKS (default)
Flannel/Canal	VXLAN	DOKS, simple setups	DOKS

In 2026, Cilium is the fastest-growing CNI. It uses eBPF (extended Berkeley Packet Filter) — programs that run in the Linux kernel — instead of iptables. This gives it better performance, lower latency, and richer observability. Cilium v1.19 is the current stable release.

Cloud Provider CNI Defaults

AWS EKS — VPC CNI

EKS uses the AWS VPC CNI by default. Pods get real VPC IP addresses — not overlay IPs. This means pods are directly accessible within your VPC without any NAT.

The trade-off: you need enough IP addresses in your VPC subnet. Large clusters can exhaust IP space quickly.

# Check VPC CNI version
kubectl describe daemonset aws-node -n kube-system | grep Image

GCP GKE — Cilium (Dataplane V2)

GKE uses Cilium as its default CNI in Dataplane V2 mode. eBPF-based, good performance, network policy support built in.

Azure AKS — Azure CNI Overlay

AKS uses Azure CNI Overlay by default. Pods get IPs from a private overlay CIDR, not directly from the VNet. This reduces IP exhaustion issues compared to the older Azure CNI flat model.

DigitalOcean DOKS — Flannel

DOKS uses Flannel with VXLAN. Simple and reliable, but no built-in network policy support. Install Calico or Cilium on top for network policies.

Network Policies — Controlling Traffic Between Pods

By default, all pods in a Kubernetes cluster can communicate with all other pods. Network Policies restrict this.

Deny all ingress to a namespace by default:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-ingress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress

Allow only specific pods to connect:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-to-db
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: database
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: api-server
    ports:
    - protocol: TCP
      port: 5432

This allows only pods with label app: api-server to connect to pods with label app: database on port 5432. Everything else is blocked.

Note: Network Policies require a CNI that supports them — Calico, Cilium, or Weave Net. Flannel alone does not enforce network policies.

Troubleshooting Node Connectivity

Check if two pods can communicate:

# Get pod IPs
kubectl get pods -o wide

# Exec into a pod and test connectivity
kubectl exec -it pod-a -- curl http://10.244.1.5:8080

# Test with ping
kubectl exec -it pod-a -- ping 10.244.1.5

Check CNI plugin is running on all nodes:

kubectl get pods -n kube-system | grep -E "flannel|calico|cilium|aws-node"

Check node network status:

kubectl describe node &lt;node-name> | grep -A 5 "Conditions:"

Debug with network tools:

# Run a debug pod with network tools
kubectl run netdebug --image=nicolaka/netshoot --rm -it -- bash

# Inside the pod
ping 10.244.1.5
traceroute 10.244.1.5
nslookup kubernetes.default
curl http://my-service.my-namespace.svc.cluster.local

Check kube-proxy is running:

kubectl get pods -n kube-system | grep kube-proxy
kubectl logs -n kube-system kube-proxy-&lt;pod-name>

Summary

Node pools:

Group nodes by workload type and hardware requirements
Use spot/preemptible pools for non-critical workloads (60-90% cheaper)
Use taints and tolerations to control which pods go to which pools
Each cloud provider has a slightly different setup but the concepts are the same

Node networking:

Kubernetes requires every pod to reach every other pod without NAT
CNI plugins implement this — Flannel for simplicity, Calico for network policies, Cilium for performance
Each cloud provider has a default CNI — VPC CNI on EKS, Cilium on GKE, Azure CNI on AKS
Network Policies restrict pod-to-pod traffic — use them in production