Pod生命周期与调度策略

1. Pod 状态机

Pending → Running → Succeeded
                 ↘
                  Failed
                 ↘
                  Unknown

状态	含义
Pending	已创建但未运行（调度中、拉镜像中）
Running	至少一个容器在跑
Succeeded	所有容器正常退出（Job 类）
Failed	至少一个容器异常退出
Unknown	联系不上 kubelet

kubectl get pods 看到的列还有：

CrashLoopBackOff：容器反复崩，K8s 退避重启
ImagePullBackOff：拉镜像失败
Init:0/2：init container 还没跑完
Terminating：删除中

2. 容器生命周期

[init container 1] → [init container 2] → ...
   ↓
[main container 启动]
   ↓
[postStart hook]   ← 异步，不阻塞
   ↓
[readinessProbe 通过] → Service 加入流量
[livenessProbe 持续检查]
   ↓
[收到 SIGTERM]
   ↓
[preStop hook]     ← 同步，阻塞
   ↓
[发 SIGTERM 给容器进程，等 terminationGracePeriodSeconds]
   ↓
[超时强杀 SIGKILL]

2.1 探针（Probes）

spec:
  containers:
    - name: web
      readinessProbe:        # 决定是否接流量
        httpGet:
          path: /ready
          port: 80
        initialDelaySeconds: 5
        periodSeconds: 10
        timeoutSeconds: 3
        successThreshold: 1
        failureThreshold: 3

      livenessProbe:         # 决定是否重启
        httpGet:
          path: /health
          port: 80
        initialDelaySeconds: 30
        periodSeconds: 30
        failureThreshold: 3

      startupProbe:          # 慢启动应用专用
        httpGet:
          path: /health
          port: 80
        failureThreshold: 30
        periodSeconds: 10

三种探针差别：

探针	失败后果
readinessProbe	从 Service 摘除（不接新流量）
livenessProbe	重启容器
startupProbe	启动期专用，未通过前另两个不生效

重要：readiness 失败 ≠ 重启。这区分对前端 SSR 很重要：依赖的下游短暂不可用时不该重启自己。

2.2 探针类型

# HTTP
httpGet:
  path: /health
  port: 80

# TCP
tcpSocket:
  port: 80

# 命令
exec:
  command: ["sh", "-c", "test -f /tmp/healthy"]

# gRPC（K8s 1.24+）
grpc:
  port: 9000
  service: health

2.3 hook

lifecycle:
  postStart:
    exec:
      command: ["sh", "-c", "echo started"]
  preStop:
    exec:
      command: ["sh", "-c", "sleep 10 && nginx -s quit"]

preStop 是优雅退出关键：先 sleep 5-10 秒让 Service 把流量摘除，再让应用停。

2.4 优雅退出

spec:
  terminationGracePeriodSeconds: 30   # 默认 30s
  containers:
    - name: web
      lifecycle:
        preStop:
          exec:
            command: ["sh", "-c", "sleep 5 && nginx -s quit"]

应用必须正确处理 SIGTERM：

process.on('SIGTERM', async () => {
  server.close()
  await closeDbConnections()
  process.exit(0)
})

3. init container

主容器启动前必须完成的任务：

spec:
  initContainers:
    - name: wait-for-db
      image: busybox
      command: ["sh", "-c", "until nc -z db 5432; do sleep 1; done"]
    - name: migrate
      image: myapp:v1
      command: ["npm", "run", "migrate"]
  containers:
    - name: web
      image: myapp:v1

按顺序执行，全部成功才起主容器。失败按 restartPolicy 重试。

4. 调度策略

4.1 nodeSelector（最简）

spec:
  nodeSelector:
    disktype: ssd
    region: cn-hangzhou

节点要打了对应标签：

kubectl label nodes node-1 disktype=ssd

4.2 affinity / anti-affinity

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node-type
                operator: In
                values: [frontend]

    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: my-frontend
            topologyKey: kubernetes.io/hostname

required 硬要求，preferred 软偏好。podAntiAffinity 让自己的 Pod 分散到不同节点（高可用）。

4.3 topologySpreadConstraints（推荐）

比 affinity 更直观：

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: my-frontend
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app: my-frontend

强制每节点最多 1 个 + 偏好跨可用区。

4.4 taint / toleration

节点打 taint，只接受能 tolerate 的 Pod：

# 节点
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule

# Pod
tolerations:
  - key: gpu
    operator: Equal
    value: "true"
    effect: NoSchedule

GPU 节点、master 节点常用。

5. 资源 requests 与 limits

resources:
  requests:
    cpu: 100m         # 0.1 核（保留）
    memory: 128Mi
  limits:
    cpu: 500m         # 0.5 核（最大）
    memory: 256Mi

概念	作用
requests	调度依据：节点必须有这么多剩余才放你
limits	运行时限制：超 CPU 限速、超内存 OOM kill

requests = 0 会被认为"没占资源"，节点超卖严重时被驱逐。

5.1 QoS 等级

Guaranteed：requests = limits        ← 优先级最高
Burstable：有 requests 但 ≠ limits  ← 中
BestEffort：都没设                   ← 最低，先被驱逐

生产关键服务用 Guaranteed。

6. 驱逐与抢占

节点资源紧张时 kubelet 驱逐 Pod。优先级：

BestEffort
Burstable（用量超 requests 多的）
Guaranteed（最后）

PriorityClass 可让重要 Pod 抢占低优先级 Pod 的位置：

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false

7. 重启策略

spec:
  restartPolicy: Always   # Always | OnFailure | Never

Pod 控制器（Deployment）只能 Always
Job 用 OnFailure
一次性任务用 Never

8. 故障排查

# 看为什么 Pending
kubectl describe pod <pod>
# Events 段：FailedScheduling、ImagePullBackOff

# CrashLoopBackOff
kubectl logs <pod> --previous     # 上次崩的日志
kubectl describe pod <pod>        # exit code

# Liveness 反复重启
kubectl describe pod <pod> | grep -A 5 "Last State"

9. 常见反模式

不设 readinessProbe：Pod 起来但应用没就绪就接流量
liveness 检查依赖外部：DB 短暂故障导致 Pod 反复重启
terminationGracePeriodSeconds 太短：长请求被截断
不用 preStop sleep：流量摘除前就停应用，正在处理的请求 502
requests = limits 太小：被频繁 OOM kill
没有 podAntiAffinity / topologySpread：Pod 集中在一节点，节点挂全挂
liveness 和 readiness 同一接口：依赖故障时连续重启雪崩
init container 不幂等：迁移脚本跑两次出问题

1. Pod 状态机​

2. 容器生命周期​

2.1 探针（Probes）​

2.2 探针类型​

2.3 hook​

2.4 优雅退出​

3. init container​

4. 调度策略​

4.1 nodeSelector（最简）​

4.2 affinity / anti-affinity​

4.3 topologySpreadConstraints（推荐）​

4.4 taint / toleration​

5. 资源 requests 与 limits​

5.1 QoS 等级​

6. 驱逐与抢占​

7. 重启策略​

8. 故障排查​

9. 常见反模式​

10. 延伸阅读​