设计监控告警系统
问题
如何用 Go 设计一个完整的监控告警系统,涵盖指标采集、存储、可视化和告警?
答案
整体架构
指标类型(Prometheus 四类)
| 类型 | 说明 | 示例 |
|---|---|---|
| Counter | 只增不减的计数器 | 请求总数、错误次数 |
| Gauge | 可增可减的仪表盘 | 当前连接数、Goroutine 数 |
| Histogram | 直方图,分桶统计分布 | 请求延迟分布 |
| Summary | 摘要,计算分位数 | P50/P99 延迟 |
埋点实现
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
// Counter: 请求总数
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "HTTP 请求总数",
},
[]string{"method", "path", "status"},
)
// Histogram: 请求延迟
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP 请求延迟(秒)",
Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5},
},
[]string{"method", "path"},
)
// Gauge: 当前活跃连接
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "当前活跃连接数",
},
)
// Gauge: Goroutine 数量
goroutineCount = promauto.NewGaugeFunc(
prometheus.GaugeOpts{
Name: "goroutine_count",
Help: "当前 Goroutine 数量",
},
func() float64 { return float64(runtime.NumGoroutine()) },
)
)
Gin 中间件自动埋点
func PrometheusMiddleware() gin.HandlerFunc {
return func(c *gin.Context) {
start := time.Now()
path := c.FullPath() // 使用路由模板而非实际路径,避免高基数
if path == "" {
path = "unknown"
}
c.Next()
duration := time.Since(start).Seconds()
status := strconv.Itoa(c.Writer.Status())
httpRequestsTotal.WithLabelValues(c.Request.Method, path, status).Inc()
httpRequestDuration.WithLabelValues(c.Request.Method, path).Observe(duration)
}
}
func main() {
r := gin.New()
r.Use(PrometheusMiddleware())
// 暴露 /metrics 端点给 Prometheus 抓取
r.GET("/metrics", gin.WrapH(promhttp.Handler()))
r.GET("/api/users", getUsers)
r.Run(":8080")
}
标签基数陷阱
绝对不要用 userID、requestID 等高基数值作为 Prometheus 标签,否则会导致时间序列爆炸、内存暴涨。标签只用低基数值(method、status、service 等)。
自定义业务指标
// 业务指标:订单创建
var orderCreated = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "order_created_total",
Help: "订单创建数量",
},
[]string{"channel"}, // app, web, api
)
// 业务指标:支付延迟
var paymentDuration = promauto.NewHistogram(
prometheus.HistogramOpts{
Name: "payment_duration_seconds",
Help: "支付处理耗时",
Buckets: prometheus.DefBuckets,
},
)
func CreateOrder(ctx context.Context, req OrderReq) error {
orderCreated.WithLabelValues(req.Channel).Inc()
start := time.Now()
err := processPayment(ctx, req)
paymentDuration.Observe(time.Since(start).Seconds())
return err
}
SLI / SLO 定义
# 可用性 SLO: 99.9%
# SLI = 成功请求 / 总请求
- record: sli:availability
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# 延迟 SLO: P99 < 500ms
- record: sli:latency_p99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
告警规则
# Prometheus 告警规则
groups:
- name: service-alerts
rules:
# 错误率 > 1% 持续 5 分钟
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "高错误率告警"
description: "错误率 {{ $value | humanizePercentage }}"
# Goroutine 数量异常
- alert: GoroutineLeak
expr: goroutine_count > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Goroutine 可能泄漏"
# P99 延迟过高
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 0.5
for: 5m
labels:
severity: warning
Go 中实现告警推送
// Alertmanager Webhook 接收器
type AlertWebhook struct {
Status string `json:"status"` // firing / resolved
Alerts []Alert `json:"alerts"`
}
type Alert struct {
Labels map[string]string `json:"labels"`
Annotations map[string]string `json:"annotations"`
StartsAt time.Time `json:"startsAt"`
}
func HandleAlert(c *gin.Context) {
var webhook AlertWebhook
if err := c.ShouldBindJSON(&webhook); err != nil {
c.JSON(400, gin.H{"error": err.Error()})
return
}
for _, alert := range webhook.Alerts {
msg := fmt.Sprintf("[%s] %s\n%s",
alert.Labels["severity"],
alert.Annotations["summary"],
alert.Annotations["description"],
)
// 按严重程度选择通知渠道
switch alert.Labels["severity"] {
case "critical":
sendDingTalk(msg) // 钉钉 + 电话
sendSMS(msg)
case "warning":
sendDingTalk(msg) // 仅钉钉
}
}
c.JSON(200, gin.H{"status": "ok"})
}
关键监控维度
| 维度 | 指标 |
|---|---|
| RED 方法 | Rate(请求速率)、Errors(错误率)、Duration(延迟) |
| USE 方法 | Utilization(利用率)、Saturation(饱和度)、Errors |
| 运行时 | Goroutine 数、GC 暂停、内存、CPU |
| 基础设施 | 磁盘、网络、连接数 |
| 业务 | 订单量、支付成功率、转化率 |
常见面试问题
Q1: Prometheus Pull vs Push 模式怎么选?
答案:
- Pull(Prometheus 默认):Prometheus 主动抓取目标的
/metrics。适合长期运行的服务 - Push:服务主动推送到 Pushgateway。适合批处理、短生命周期 Job
- Go 微服务推荐 Pull 模式,配合 Service Discovery 自动发现目标
Q2: Histogram 和 Summary 怎么选?
答案:
- Histogram:服务端分桶,Prometheus 聚合时可跨实例计算分位数;桶边界固定
- Summary:客户端直接算分位数,不可跨实例聚合
- 推荐 Histogram,因为多实例场景下可聚合
Q3: 监控告警如何避免"告警风暴"?
答案:
- 告警聚合:
group_by相同告警,同类只发一条 - 告警抑制:
inhibit_rules高优先级告警抑制低优先级 - 静默规则:维护窗口期间静默
- 告警分级:critical 电话、warning 钉钉、info 仅记录
Q4: Go 运行时需要监控哪些指标?
答案:
- Goroutine 数量(检测泄漏)
- GC 暂停时间和频率
- 堆内存使用量 / 堆对象数
- 线程数
promhttp.Handler() 默认暴露 go_* 前缀的运行时指标。
Q5: 如何做到秒级监控?
答案:
- Prometheus 的最小 scrape interval 通常 10~15s
- 需要秒级,可以在应用内自行聚合 + 推送到时序数据库(InfluxDB / VictoriaMetrics)
- 或使用 Datadog Agent 等商业方案