跳到主要内容

设计监控告警系统

问题

如何用 Python 设计一个应用监控告警系统?Prometheus 指标如何埋点?

答案

架构

Prometheus 指标埋点

monitoring/metrics.py
from prometheus_client import Counter, Histogram, Gauge, Info

# 请求计数
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"],
)

# 请求延迟(直方图)
REQUEST_LATENCY = Histogram(
"http_request_duration_seconds",
"HTTP request latency",
["method", "endpoint"],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
)

# 活跃连接数(当前值)
ACTIVE_CONNECTIONS = Gauge(
"active_connections",
"Number of active connections",
)

# 应用信息
APP_INFO = Info("app", "Application information")
APP_INFO.info({"version": "1.2.0", "env": "production"})

FastAPI 中间件

monitoring/middleware.py
import time
from fastapi import Request
from starlette.middleware.base import BaseHTTPMiddleware
from prometheus_client import make_asgi_app

class MetricsMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
method = request.method
endpoint = request.url.path

ACTIVE_CONNECTIONS.inc()
start = time.perf_counter()

try:
response = await call_next(request)
status = response.status_code
except Exception:
status = 500
raise
finally:
duration = time.perf_counter() - start
REQUEST_COUNT.labels(method, endpoint, status).inc()
REQUEST_LATENCY.labels(method, endpoint).observe(duration)
ACTIVE_CONNECTIONS.dec()

return response

# 挂载 /metrics 端点
app = FastAPI()
app.add_middleware(MetricsMiddleware)
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

自定义业务指标

monitoring/business.py
from prometheus_client import Counter, Histogram

# 业务指标
ORDER_CREATED = Counter("orders_created_total", "Total orders created", ["product_type"])
PAYMENT_AMOUNT = Histogram(
"payment_amount_yuan",
"Payment amount distribution",
buckets=[10, 50, 100, 500, 1000, 5000],
)

def create_order(order):
# 业务逻辑...
ORDER_CREATED.labels(product_type=order.type).inc()
PAYMENT_AMOUNT.observe(order.amount)

健康检查

monitoring/health.py
from fastapi import FastAPI
import redis
from sqlalchemy import text

@app.get("/health")
async def health_check():
checks = {}
# 数据库
try:
db.execute(text("SELECT 1"))
checks["database"] = "ok"
except Exception:
checks["database"] = "error"

# Redis
try:
redis_client.ping()
checks["redis"] = "ok"
except Exception:
checks["redis"] = "error"

all_ok = all(v == "ok" for v in checks.values())
return {"status": "healthy" if all_ok else "unhealthy", "checks": checks}

@app.get("/ready")
async def readiness():
"""就绪探针:是否可以接收流量"""
return {"status": "ready"}

Alertmanager 告警规则

prometheus/alert_rules.yml
groups:
- name: python-app
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "5xx 错误率超过 5%"

- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 3m
labels:
severity: warning
annotations:
summary: "P95 延迟超过 2 秒"

常见面试问题

Q1: 四种指标类型?

答案

类型说明用途
Counter只增不减请求数、错误数
Gauge可增可减连接数、温度
Histogram分桶统计延迟分布、大小分布
Summary客户端分位数延迟 P99

Q2: 监控三大支柱?

答案

  1. Metrics:数值指标(Prometheus + Grafana)
  2. Logging:日志(ELK / Loki)
  3. Tracing:链路追踪(Jaeger / Zipkin)

三者结合:从 Dashboard 发现异常指标 → 查链路追踪定位服务 → 看日志找根因

Q3: SLI / SLO / SLA 的区别?

答案

  • SLI(服务级别指标):可用率、延迟 P99
  • SLO(服务级别目标):SLI 的目标值,如可用率 > 99.9%
  • SLA(服务级别协议):对外承诺,违反有赔偿

相关链接