Part 1: kubernetes-internals

How Kubernetes Decides Where to Run a Pod

When you run kubectl apply -f pod.yaml, the Pod appears on some node in the cluster. It feels like magic. It isn't. There's a deterministic process behind it, and understanding it lets you predict — and control — where your workloads land.

The kube-scheduler

The kube-scheduler is a process independent of the rest of the control plane. Its only responsibility is assigning Pods to nodes. It doesn't run them, doesn't monitor them: it only decides spec.nodeName.

The main loop is conceptually simple:

1. Watch the queue of Pods without an assigned node
2. For each Pod, filter out nodes that CANNOT run it
3. Among remaining nodes, score and choose the best
4. Write spec.nodeName into the Pod

Each step has real complexity. Let's go through them.

Phase 1: Filtering (predicates)

The scheduler applies a series of filter plugins that eliminate incompatible nodes. The most important:

NodeResourcesFit

Checks that the node has enough resources for the Pod's requests:

yaml
resources:
  requests:
    cpu: "500m"
    memory: "256Mi"

If the node has 300m CPU available, it's eliminated. requests are the scheduling unit — limits don't influence scheduling.

NodeSelector and NodeAffinity

yaml
nodeSelector:
  kubernetes.io/arch: amd64

Eliminates nodes that don't have that label. nodeAffinity offers the same functionality with more expressive syntax: requiredDuringSchedulingIgnoredDuringExecution is a hard requirement (equivalent to nodeSelector), preferredDuringScheduling... is a soft preference that influences scoring.

TaintToleration

A node with a taint rejects Pods that don't have the matching toleration. Useful for GPU nodes, system nodes, or any node that should be reserved:

bash
kubectl taint nodes gpu-node-1 dedicated=gpu:NoSchedule

Only Pods with this toleration can be scheduled there:

yaml
tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"

PodTopologySpread

Ensures distribution across availability zones or nodes. Prevents all your Pods from ending up on the same physical node:

yaml
topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-service

maxSkew: 1 means the difference between the node with the most Pods and the node with the fewest cannot exceed 1.

Phase 2: Scoring (priorities)

Of the nodes that survived filtering, the scheduler assigns a score of 0 to 100 to each. The most relevant scoring plugins:

Plugin Favors
LeastAllocated Nodes with more free resources (spreading)
MostAllocated Fuller nodes (bin-packing)
NodeAffinity Nodes matching soft preferences
InterPodAffinity Co-location with other Pods
ImageLocality Nodes that already have the image downloaded

By default, LeastAllocated carries the most weight. The result is that Pods spread across nodes.

Phase 3: Selection and binding

The scheduler picks the highest-scoring node. On ties, it picks randomly among the tied nodes.

Then it performs the binding: a write to the Kubernetes API that updates the Pod's spec.nodeName. The kubelet on the destination node detects the change and starts the containers.

Preemption: when there's no room

If no node passes filtering, the Pod stays Pending. The scheduler can then attempt preemption: find lower-priority Pods on some node, evict them, and free space for the pending Pod.

This requires the Pod to have a defined PriorityClass:

yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
yaml
spec:
  priorityClassName: high-priority

A Pod with value: 1000000 can evict Pods with lower values. Pods without priorityClassName have priority 0.

Practical diagnosis

If a Pod is Pending, the first place to look:

bash
kubectl describe pod <name>

The Events section says exactly which plugin failed and why:

Warning  FailedScheduling  0/3 nodes are available:
  1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: },
  2 node(s) didn't match Pod's node affinity/selector.

There's also the scheduler extender and plugin framework for advanced customization, but that's material for another article.

Summary

The kube-scheduler makes a two-phase decision: first it eliminates nodes that cannot run the Pod (filtering), then it picks the best among the remaining ones (scoring). Preemption exists as a last resort for high-priority Pods.

Understanding these phases lets you debug Pods stuck in Pending, design node topologies with intent, and write manifests that express exactly where your workload should run.

← Back