跳到主要内容

Pod生命周期与调度策略

1. Pod 状态机

Pending → Running → Succeeded

Failed

Unknown
状态含义
Pending已创建但未运行(调度中、拉镜像中)
Running至少一个容器在跑
Succeeded所有容器正常退出(Job 类)
Failed至少一个容器异常退出
Unknown联系不上 kubelet

kubectl get pods 看到的列还有:

  • CrashLoopBackOff:容器反复崩,K8s 退避重启
  • ImagePullBackOff:拉镜像失败
  • Init:0/2:init container 还没跑完
  • Terminating:删除中

2. 容器生命周期

[init container 1] → [init container 2] → ...

[main container 启动]

[postStart hook] ← 异步,不阻塞

[readinessProbe 通过] → Service 加入流量
[livenessProbe 持续检查]

[收到 SIGTERM]

[preStop hook] ← 同步,阻塞

[发 SIGTERM 给容器进程,等 terminationGracePeriodSeconds]

[超时强杀 SIGKILL]

2.1 探针(Probes)

spec:
containers:
- name: web
readinessProbe: # 决定是否接流量
httpGet:
path: /ready
port: 80
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3

livenessProbe: # 决定是否重启
httpGet:
path: /health
port: 80
initialDelaySeconds: 30
periodSeconds: 30
failureThreshold: 3

startupProbe: # 慢启动应用专用
httpGet:
path: /health
port: 80
failureThreshold: 30
periodSeconds: 10

三种探针差别:

探针失败后果
readinessProbe从 Service 摘除(不接新流量)
livenessProbe重启容器
startupProbe启动期专用,未通过前另两个不生效

重要:readiness 失败 ≠ 重启。这区分对前端 SSR 很重要:依赖的下游短暂不可用时不该重启自己。

2.2 探针类型

# HTTP
httpGet:
path: /health
port: 80

# TCP
tcpSocket:
port: 80

# 命令
exec:
command: ["sh", "-c", "test -f /tmp/healthy"]

# gRPC(K8s 1.24+)
grpc:
port: 9000
service: health

2.3 hook

lifecycle:
postStart:
exec:
command: ["sh", "-c", "echo started"]
preStop:
exec:
command: ["sh", "-c", "sleep 10 && nginx -s quit"]

preStop优雅退出关键:先 sleep 5-10 秒让 Service 把流量摘除,再让应用停。

2.4 优雅退出

spec:
terminationGracePeriodSeconds: 30 # 默认 30s
containers:
- name: web
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 5 && nginx -s quit"]

应用必须正确处理 SIGTERM:

process.on('SIGTERM', async () => {
server.close()
await closeDbConnections()
process.exit(0)
})

3. init container

主容器启动前必须完成的任务:

spec:
initContainers:
- name: wait-for-db
image: busybox
command: ["sh", "-c", "until nc -z db 5432; do sleep 1; done"]
- name: migrate
image: myapp:v1
command: ["npm", "run", "migrate"]
containers:
- name: web
image: myapp:v1

按顺序执行,全部成功才起主容器。失败按 restartPolicy 重试。

4. 调度策略

4.1 nodeSelector(最简)

spec:
nodeSelector:
disktype: ssd
region: cn-hangzhou

节点要打了对应标签:

kubectl label nodes node-1 disktype=ssd

4.2 affinity / anti-affinity

spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values: [frontend]

podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: my-frontend
topologyKey: kubernetes.io/hostname

required 硬要求,preferred 软偏好。podAntiAffinity 让自己的 Pod 分散到不同节点(高可用)。

4.3 topologySpreadConstraints(推荐)

比 affinity 更直观:

spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-frontend
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: my-frontend

强制每节点最多 1 个 + 偏好跨可用区。

4.4 taint / toleration

节点打 taint,只接受能 tolerate 的 Pod:

# 节点
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule
# Pod
tolerations:
- key: gpu
operator: Equal
value: "true"
effect: NoSchedule

GPU 节点、master 节点常用。

5. 资源 requests 与 limits

resources:
requests:
cpu: 100m # 0.1 核(保留)
memory: 128Mi
limits:
cpu: 500m # 0.5 核(最大)
memory: 256Mi
概念作用
requests调度依据:节点必须有这么多剩余才放你
limits运行时限制:超 CPU 限速、超内存 OOM kill

requests = 0 会被认为"没占资源",节点超卖严重时被驱逐。

5.1 QoS 等级

Guaranteed:requests = limits ← 优先级最高
Burstable:有 requests 但 ≠ limits ← 中
BestEffort:都没设 ← 最低,先被驱逐

生产关键服务用 Guaranteed。

6. 驱逐与抢占

节点资源紧张时 kubelet 驱逐 Pod。优先级:

  1. BestEffort
  2. Burstable(用量超 requests 多的)
  3. Guaranteed(最后)

PriorityClass 可让重要 Pod 抢占低优先级 Pod 的位置:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false

7. 重启策略

spec:
restartPolicy: Always # Always | OnFailure | Never
  • Pod 控制器(Deployment)只能 Always
  • Job 用 OnFailure
  • 一次性任务用 Never

8. 故障排查

# 看为什么 Pending
kubectl describe pod <pod>
# Events 段:FailedScheduling、ImagePullBackOff

# CrashLoopBackOff
kubectl logs <pod> --previous # 上次崩的日志
kubectl describe pod <pod> # exit code

# Liveness 反复重启
kubectl describe pod <pod> | grep -A 5 "Last State"

9. 常见反模式

  • 不设 readinessProbe:Pod 起来但应用没就绪就接流量
  • liveness 检查依赖外部:DB 短暂故障导致 Pod 反复重启
  • terminationGracePeriodSeconds 太短:长请求被截断
  • 不用 preStop sleep:流量摘除前就停应用,正在处理的请求 502
  • requests = limits 太小:被频繁 OOM kill
  • 没有 podAntiAffinity / topologySpread:Pod 集中在一节点,节点挂全挂
  • liveness 和 readiness 同一接口:依赖故障时连续重启雪崩
  • init container 不幂等:迁移脚本跑两次出问题

10. 延伸阅读