链路追踪与OpenTelemetry
1. 概念
分布式系统里一个用户请求经过 N 个服务。链路追踪给请求一个 traceId,贯穿所有服务,形成一棵 span 树:
Browser fetch /order
└─ Gateway [SpanA] 100ms
└─ Order Svc [SpanB] 90ms
├─ User Svc [SpanC] 20ms
├─ Stock Svc [SpanD] 30ms
└─ DB query [SpanE] 35ms
排障神器:5xx 哪一段慢、哪个 SQL 阻塞、哪个外部服务挂了一目了然。
2. OpenTelemetry(OTel)
CNCF 标准,统一 traces / metrics / logs SDK + collector。
应用 (OTel SDK) → OTel Collector → 后端(Jaeger / Tempo / Datadog)
3. 前端接入(Browser)
import { WebTracerProvider } from '@opentelemetry/sdk-trace-web'
import { ZoneContextManager } from '@opentelemetry/context-zone'
import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch'
import { XMLHttpRequestInstrumentation } from '@opentelemetry/instrumentation-xml-http-request'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base'
import { Resource } from '@opentelemetry/resources'
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions'
import { registerInstrumentations } from '@opentelemetry/instrumentation'
const provider = new WebTracerProvider({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'frontend',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
}),
})
provider.addSpanProcessor(
new BatchSpanProcessor(
new OTLPTraceExporter({
url: 'https://otel-collector.example.com/v1/traces',
})
)
)
provider.register({ contextManager: new ZoneContextManager() })
registerInstrumentations({
instrumentations: [
new FetchInstrumentation({
propagateTraceHeaderCorsUrls: [/api\.example\.com/],
}),
new XMLHttpRequestInstrumentation(),
],
})
自动给所有 fetch 加 traceparent 头,后端拿到后串联。
4. Node 后端接入
import { NodeSDK } from '@opentelemetry/sdk-node'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'
const sdk = new NodeSDK({
serviceName: 'api',
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4317',
}),
instrumentations: [getNodeAutoInstrumentations()],
})
sdk.start()
自动埋点:HTTP / Express / Koa / Postgres / MySQL / Redis / GraphQL ...
5. W3C Trace Context
跨服务传递 traceId 的标准 header:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
^ ^ ^ ^
version trace-id span-id flags
每个服务生成新 span,parentId 是上游 span。
6. OTel Collector
中心化收集 + 处理 + 转发:
# config.yaml
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
processors:
batch: {}
memory_limiter:
limit_mib: 512
exporters:
otlp:
endpoint: tempo:4317
tls: { insecure: true }
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
7. 后端选型
| 后端 | 特点 |
|---|---|
| Jaeger | CNCF 老牌,开源 |
| Tempo(Grafana) | 廉价,按对象存储计费,Grafana 集成 |
| Zipkin | 老牌 |
| Datadog APM | SaaS 全套 |
| 阿里 ARMS | 国内 SaaS |
中小规模:Tempo + Grafana 性价比最高。
8. 采样策略
全采样数据量爆炸。常见:
- Head-based:在入口决定(
tracesSampler函数)- 1% 默认 + 100% 采错误请求
- 按用户 ID hash 采样
- Tail-based(OTel Collector 处理):完整 trace 收完再决定保不保留
- 慢请求保留
- 错误保留
- 否则按比例采
9. 业务自定义 span
import { trace } from '@opentelemetry/api'
const tracer = trace.getTracer('order-service')
async function processOrder(orderId: string) {
const span = tracer.startSpan('processOrder', {
attributes: { orderId },
})
try {
const user = await tracer.startActiveSpan('getUser', async (s) => {
const u = await db.user.find(...)
s.setAttribute('user.id', u.id)
s.end()
return u
})
// ...
span.setStatus({ code: SpanStatusCode.OK })
} catch (err) {
span.recordException(err)
span.setStatus({ code: SpanStatusCode.ERROR })
throw err
} finally {
span.end()
}
}
10. Logs / Metrics 关联
每条日志带 traceId:
import { trace } from '@opentelemetry/api'
logger.info({
traceId: trace.getActiveSpan()?.spanContext().traceId,
userId,
}, '用户登录')
Grafana 里日志一键跳转到对应 trace。
11. 性能开销
OTel SDK 通常 < 5% CPU 开销。不要在生产对每个 redis call 都 trace(嵌套 span 太多)。
12. 常见反模式
- 不传 traceparent:链路断
- 采样 100%:存储爆
- 同步导出 span:阻塞业务
- trace 含 PII:合规问题
- 每个函数都手动 span:噪声多。用 auto-instrumentation
- 不和日志关联:trace 看到慢但没法定位代码