When you run kubectl apply -f pod.yaml, the Pod appears on some node in the cluster. It feels like magic. It isn't. There's a deterministic process behind it, and understanding it lets you predict — and control — where your workloads land.
The kube-scheduler
The kube-scheduler is a process independent of the rest of the control plane. Its only responsibility is assigning Pods to nodes. It doesn't run them, doesn't monitor them: it only decides spec.nodeName.
The main loop is conceptually simple:
1. Watch the queue of Pods without an assigned node
2. For each Pod, filter out nodes that CANNOT run it
3. Among remaining nodes, score and choose the best
4. Write spec.nodeName into the Pod
Each step has real complexity. Let's go through them.
Phase 1: Filtering (predicates)
The scheduler applies a series of filter plugins that eliminate incompatible nodes. The most important:
NodeResourcesFit
Checks that the node has enough resources for the Pod's requests:
resources:
requests:
cpu: "500m"
memory: "256Mi"
If the node has 300m CPU available, it's eliminated. requests are the scheduling unit — limits don't influence scheduling.
NodeSelector and NodeAffinity
nodeSelector:
kubernetes.io/arch: amd64
Eliminates nodes that don't have that label. nodeAffinity offers the same functionality with more expressive syntax: requiredDuringSchedulingIgnoredDuringExecution is a hard requirement (equivalent to nodeSelector), preferredDuringScheduling... is a soft preference that influences scoring.
TaintToleration
A node with a taint rejects Pods that don't have the matching toleration. Useful for GPU nodes, system nodes, or any node that should be reserved:
kubectl taint nodes gpu-node-1 dedicated=gpu:NoSchedule
Only Pods with this toleration can be scheduled there:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
PodTopologySpread
Ensures distribution across availability zones or nodes. Prevents all your Pods from ending up on the same physical node:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-service
maxSkew: 1 means the difference between the node with the most Pods and the node with the fewest cannot exceed 1.
Phase 2: Scoring (priorities)
Of the nodes that survived filtering, the scheduler assigns a score of 0 to 100 to each. The most relevant scoring plugins:
| Plugin | Favors |
|---|---|
LeastAllocated |
Nodes with more free resources (spreading) |
MostAllocated |
Fuller nodes (bin-packing) |
NodeAffinity |
Nodes matching soft preferences |
InterPodAffinity |
Co-location with other Pods |
ImageLocality |
Nodes that already have the image downloaded |
By default, LeastAllocated carries the most weight. The result is that Pods spread across nodes.
Phase 3: Selection and binding
The scheduler picks the highest-scoring node. On ties, it picks randomly among the tied nodes.
Then it performs the binding: a write to the Kubernetes API that updates the Pod's spec.nodeName. The kubelet on the destination node detects the change and starts the containers.
Preemption: when there's no room
If no node passes filtering, the Pod stays Pending. The scheduler can then attempt preemption: find lower-priority Pods on some node, evict them, and free space for the pending Pod.
This requires the Pod to have a defined PriorityClass:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
spec:
priorityClassName: high-priority
A Pod with value: 1000000 can evict Pods with lower values. Pods without priorityClassName have priority 0.
Practical diagnosis
If a Pod is Pending, the first place to look:
kubectl describe pod <name>
The Events section says exactly which plugin failed and why:
Warning FailedScheduling 0/3 nodes are available:
1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: },
2 node(s) didn't match Pod's node affinity/selector.
There's also the scheduler extender and plugin framework for advanced customization, but that's material for another article.
Summary
The kube-scheduler makes a two-phase decision: first it eliminates nodes that cannot run the Pod (filtering), then it picks the best among the remaining ones (scoring). Preemption exists as a last resort for high-priority Pods.
Understanding these phases lets you debug Pods stuck in Pending, design node topologies with intent, and write manifests that express exactly where your workload should run.