Prometheus-Grafana监控体系
1. 概念
Prometheus 是 Pull 模型:每个被监控对象暴露 /metrics 端点,Prometheus 定期抓取。
┌─────── 应用 / Node Exporter / kube-state-metrics ──────┐
│ 暴露 /metrics │
└───────────────────────┬──────────────────────────────────┘
↓ scrape
┌─────────────────────────────────────────────────────────┐
│ Prometheus(存储 + 查询) │
└─────────────────────────────────────────────────────────┘
↓
┌─────────┴─────────┐
↓ ↓
Grafana Alertmanager
(看板可视化) (告警分发)
2. 数据模型
http_requests_total{method="GET", status="200"} 12345 @timestamp
└──────┬──────┘ └──────────┬──────────────────┘ └─┬─┘
metric name labels value
四种类型:
| 类型 | 用途 |
|---|---|
| Counter | 单调递增(请求数、错误数) |
| Gauge | 可增可减(CPU 使用率、连接数) |
| Histogram | 分布(请求延迟) |
| Summary | 类似 histogram,客户端算分位数 |
3. 应用暴露指标(Node 示例)
const express = require('express')
const client = require('prom-client')
const register = new client.Registry()
client.collectDefaultMetrics({ register })
const httpDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route', 'status'],
buckets: [0.005, 0.01, 0.05, 0.1, 0.5, 1, 5],
})
register.registerMetric(httpDuration)
const app = express()
app.use((req, res, next) => {
const end = httpDuration.startTimer()
res.on('finish', () => {
end({ method: req.method, route: req.route?.path || 'unknown', status: res.statusCode })
})
next()
})
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType)
res.send(await register.metrics())
})
4. PromQL 速查
# 当前值
http_requests_total
http_requests_total{status="500"}
# 速率(5 分钟平均每秒)
rate(http_requests_total[5m])
# 求和(按 label 聚合)
sum by (status) (rate(http_requests_total[5m]))
# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# P95 延迟
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
# 带 namespace 拆
histogram_quantile(0.95, sum by (le, route) (rate(http_request_duration_seconds_bucket[5m])))
# 同比
http_requests_total - http_requests_total offset 1d
# 节点 CPU 使用率
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
# 内存使用率
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes
5. Recording Rules(预聚合)
复杂查询写规则提前算,加速 Grafana:
# rules.yml
groups:
- name: aggregations
interval: 30s
rules:
- record: api:http_requests:rate5m
expr: sum by (route, status) (rate(http_requests_total[5m]))
- record: api:http_errors:ratio5m
expr: |
sum by (route) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (route) (rate(http_requests_total[5m]))
6. 告警规则
groups:
- name: api
rules:
- alert: HighErrorRate
expr: api:http_errors:ratio5m > 0.01
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "}} $labels.route }} 错误率 > 1%"
description: "当前 }} $value | humanizePercentage }}"
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))
) > 1
for: 10m
labels:
severity: warning
7. Alertmanager 路由
route:
receiver: default
group_by: [alertname, namespace]
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
routes:
- matchers: [severity="critical"]
receiver: pager
continue: true
- matchers: [team="frontend"]
receiver: frontend-slack
receivers:
- name: pager
webhook_configs:
- url: 'https://events.pagerduty.com/...'
- name: frontend-slack
slack_configs:
- api_url: 'https://hooks.slack.com/...'
channel: '#frontend-alerts'
inhibit_rules:
- source_matchers: [severity="critical"]
target_matchers: [severity="warning"]
equal: [alertname, namespace]
8. Grafana
Datasource → Prometheus URL → 添加。
8.1 Dashboard
社区导入:
- Node Exporter Full: ID 1860
- Kubernetes Cluster Monitoring: 7249
- NGINX Ingress: 9614
- Loki Logs: 13639
或自建:Panel → 选 metric → 配可视化。
8.2 Variables(动态过滤)
namespace: label_values(kube_pod_info, namespace)
pod: label_values(kube_pod_info{namespace="$namespace"}, pod)
下拉框选 namespace → pod 联动。
8.3 模板复用
JSON 导出 / 导入。多环境共享。
9. 高基数陷阱
每个 unique label 组合 = 一条时间序列。userId / orderId 当 label = 几亿条序列 = Prometheus OOM。
Label 应该是低基数维度:method、status、route、namespace、pod 名(pod 名也是中基数,要小心)。
10. 长期存储
Prometheus 单机适合 15 天数据。长期:
- Thanos:Prometheus + 对象存储,全局视图
- VictoriaMetrics:性能更好,单点也行
- Mimir(Grafana):Cortex 演进
- 云托管:阿里 Prometheus、AMP
11. 常见反模式
- userId 当 label:基数爆炸
- 不设 retention:磁盘塞满
- 告警阈值瞎拍:业务噪声 / 漏报
- 告警没 runbook:值班人员不知道怎么处理
- 每条告警都 critical:on-call 麻木
- Grafana 不要密码:暴露
- Prometheus 暴露公网:可被任意查询
- 不用 recording rules:复杂查询每次跑全量
12. 延伸阅读
- Prometheus 官方
- PromQL 教程
- Grafana 文档
- 《Prometheus: Up & Running》Brian Brazil