设计埋点监控系统

一、需求分析

1.1 功能需求

前端埋点监控系统是保障线上产品质量的关键基础设施。一套完整的系统需覆盖以下四大功能模块：

功能模块	核心能力	典型指标
错误监控	JS 错误、Promise 异常、资源加载错误、框架渲染错误、接口异常	错误数、错误率、影响用户数
性能监控	页面加载性能、Core Web Vitals、资源加载瀑布、长任务检测	LCP、INP、CLS、TTFB、FCP
用户行为埋点	PV/UV、点击热图、页面停留时长、用户路径追踪	DAU、转化率、跳出率、留存率
自定义事件	业务自定义指标上报、A/B 测试数据采集	自定义 KPI

1.2 非功能需求

关键设计约束

在面试中回答系统设计题时，明确非功能需求是拉开差距的关键。它反映了你的工程化思维深度。

非功能需求	设计目标	实现手段
低侵入	接入只需一行代码，对业务逻辑零侵入	插件化架构、自动劫持原生 API
不影响业务性能	SDK 不影响页面加载和交互流畅度	异步加载、requestIdleCallback、Web Worker
数据准确性	数据不丢失、不重复、时序正确	缓冲队列、sendBeacon、离线重试、去重指纹
可扩展	支持自定义插件、自定义上报维度	插件系统、Hook 机制、中间件链
安全合规	数据脱敏、隐私保护、GDPR 合规	beforeSend 钩子、匿名化处理、用户授权

二、整体架构

2.1 监控链路全景

2.2 数据流转时序

三、核心模块设计

3.1 SDK 核心架构 — 插件化设计

SDK 的核心理念是 "微内核 + 插件" 架构，核心只负责生命周期管理、数据缓冲和上报通道，所有采集能力由插件提供。

生命周期设计：

sdk/types.ts
/** 插件接口 */
interface MonitorPlugin {
  /** 插件名称，唯一标识 */
  name: string;
  /** 安装阶段：注册事件监听、劫持原生 API */
  setup(ctx: MonitorContext): void;
  /** 卸载阶段：清理监听器、恢复原生 API */
  destroy?(): void;
}

/** SDK 上下文，传递给每个插件 */
interface MonitorContext {
  /** 上报数据（写入缓冲队列） */
  report(data: ReportData): void;
  /** 获取全局配置 */
  getConfig(): MonitorConfig;
  /** 获取公共字段（appId、userId、sessionId 等） */
  getCommonFields(): Record<string, string>;
}

/** SDK 配置 */
interface MonitorConfig {
  /** 应用标识 */
  appId: string;
  /** 上报地址 */
  reportUrl: string;
  /** 全局采样率 0~1，默认 1 */
  sampleRate?: number;
  /** 缓冲区最大条目数，默认 10 */
  maxBufferSize?: number;
  /** 定时上报间隔（ms），默认 5000 */
  flushInterval?: number;
  /** 插件列表 */
  plugins?: MonitorPlugin[];
  /** 发送前钩子，可用于数据脱敏、过滤 */
  beforeSend?(data: ReportData): ReportData | null;
}

/** 上报数据结构 */
interface ReportData {
  /** 一级分类：error / perf / behavior / custom */
  type: string;
  /** 二级分类：js_error / lcp / page_view 等 */
  subType: string;
  /** 具体数据 */
  data: Record<string, unknown>;
  /** 客户端时间戳 */
  timestamp: number;
  /** 以下字段由 SDK 自动填充 */
  appId?: string;
  userId?: string;
  sessionId?: string;
  pageUrl?: string;
  userAgent?: string;
  traceId?: string;
}

插件化的好处

按需加载：不需要性能监控的页面可以不引入 PerfPlugin，减小 SDK 体积
解耦维护：各插件独立开发、测试、发布
开放扩展：业务方可以编写自定义插件，无需修改 SDK 源码
Tree Shaking 友好：未使用的插件在构建时自动被移除

3.2 错误监控模块

错误监控是整个系统的基石。需要覆盖 5 种主要错误类型：

错误类型	捕获方式	关键信息
JS 运行时错误	`window.onerror`	message、stack、source、lineno、colno
Promise 异常	`unhandledrejection`	reason、stack
资源加载错误	`addEventListener('error', ..., true)`	tagName、src/href
跨域 Script Error	`crossorigin` + CORS 头	需要基建配合
React ErrorBoundary	`getDerivedStateFromError` + `componentDidCatch`	componentStack

跨域 Script Error 的处理：

当 <script> 加载了不同域的 JS（如 CDN），window.onerror 只能拿到一个 "Script error." 字符串，没有任何有效信息。解决方案需要前后端配合：

解决跨域 Script Error
// 1. HTML 中给 <script> 标签添加 crossorigin
// <script src="https://cdn.example.com/app.js" crossorigin="anonymous"></script>

// 2. CDN 服务器响应头添加
// Access-Control-Allow-Origin: *

// 3. 如果无法修改 CDN 配置，可以用 try/catch 包裹法：
const originAddEventListener = EventTarget.prototype.addEventListener;

EventTarget.prototype.addEventListener = function (
  type: string,
  listener: EventListenerOrEventListenerObject,
  options?: boolean | AddEventListenerOptions
): void {
  const wrappedListener = function (this: EventTarget, ...args: unknown[]): unknown {
    try {
      return typeof listener === 'function'
        ? listener.apply(this, args as [Event])
        : listener.handleEvent.apply(listener, args as [Event]);
    } catch (error) {
      // 此时可以拿到完整的错误信息
      throw error;
    }
  };

  return originAddEventListener.call(
    this,
    type,
    wrappedListener as EventListener,
    options
  );
};

Script Error 面试高频

面试官经常会问"线上看到大量 Script error. 怎么处理？"。答出 crossorigin + CORS + try/catch 包裹三种方案即可。

React ErrorBoundary 的完整实现：

sdk/plugins/react-error-boundary.tsx
import React, { Component, type ErrorInfo, type ReactNode } from 'react';

interface Props {
  children: ReactNode;
  fallback?: ReactNode | ((error: Error) => ReactNode);
  onError?: (error: Error, errorInfo: ErrorInfo) => void;
}

interface State {
  hasError: boolean;
  error: Error | null;
}

class MonitorErrorBoundary extends Component<Props, State> {
  state: State = { hasError: false, error: null };

  static getDerivedStateFromError(error: Error): Partial<State> {
    return { hasError: true, error };
  }

  componentDidCatch(error: Error, errorInfo: ErrorInfo): void {
    // 上报到监控系统
    this.props.onError?.(error, errorInfo);

    // 如果有全局 SDK 实例，也可以直接调用
    window.__MONITOR__?.report({
      type: 'error',
      subType: 'react_error',
      data: {
        message: error.message,
        stack: error.stack ?? '',
        componentStack: errorInfo.componentStack ?? '',
      },
      timestamp: Date.now(),
    });
  }

  render(): ReactNode {
    if (this.state.hasError) {
      const { fallback } = this.props;
      if (typeof fallback === 'function') {
        return fallback(this.state.error!);
      }
      return fallback ?? <h2>页面出错了，请刷新重试</h2>;
    }
    return this.props.children;
  }
}

export default MonitorErrorBoundary;

3.3 性能监控模块

性能监控的目标是量化用户体验，核心围绕 Performance API 和 Web Vitals 两大体系。

Performance API 采集全链路耗时：

sdk/plugins/perf-plugin.ts
/** 采集 Navigation Timing 全链路耗时 */
function collectNavigationTiming(): Record<string, number> | null {
  const [entry] = performance.getEntriesByType(
    'navigation'
  ) as PerformanceNavigationTiming[];
  if (!entry) return null;

  return {
    // DNS 解析
    dns: entry.domainLookupEnd - entry.domainLookupStart,
    // TCP 建连
    tcp: entry.connectEnd - entry.connectStart,
    // SSL 握手
    ssl: entry.secureConnectionStart > 0
      ? entry.connectEnd - entry.secureConnectionStart : 0,
    // 首字节时间 TTFB
    ttfb: entry.responseStart - entry.requestStart,
    // 内容下载
    download: entry.responseEnd - entry.responseStart,
    // DOM 解析
    domParse: entry.domInteractive - entry.responseEnd,
    // DOM Ready
    domReady: entry.domContentLoadedEventEnd - entry.fetchStart,
    // 页面完全加载
    pageLoad: entry.loadEventEnd - entry.fetchStart,
  };
}

Web Vitals 核心指标采集：

sdk/plugins/web-vitals-collector.ts
import { onLCP, onINP, onCLS, onFCP, onTTFB } from 'web-vitals';
import type { Metric } from 'web-vitals';

function initWebVitals(report: (data: ReportData) => void): void {
  const handleMetric = (metric: Metric): void => {
    report({
      type: 'perf',
      subType: metric.name.toLowerCase(),
      data: {
        name: metric.name,
        value: metric.value,
        rating: metric.rating,    // 'good' | 'needs-improvement' | 'poor'
        delta: metric.delta,
        id: metric.id,
        navigationType: metric.navigationType,
      },
      timestamp: Date.now(),
    });
  };

  onLCP(handleMetric);   // 最大内容绘制
  onINP(handleMetric);   // 交互到下一帧绘制
  onCLS(handleMetric);   // 累积布局偏移
  onFCP(handleMetric);   // 首次内容绘制
  onTTFB(handleMetric);  // 首字节时间
}

Core Web Vitals 阈值参考（2024 年起 INP 替代 FID）：

指标	含义	Good	Needs Improvement	Poor
LCP	最大内容绘制	$\leq 2.5s$	$2.5s \sim 4s$	$> 4s$
INP	交互到下一帧绘制	$\leq 200ms$	$200ms \sim 500ms$	$> 500ms$
CLS	累积布局偏移	$\leq 0.1$	$0.1 \sim 0.25$	$> 0.25$

资源加载瀑布图数据采集：

sdk/plugins/resource-timing.ts
function collectResourceTiming(): void {
  const observer = new PerformanceObserver((list) => {
    for (const entry of list.getEntries() as PerformanceResourceTiming[]) {
      // 过滤掉上报请求本身，避免死循环
      if (entry.name.includes('/api/report')) continue;

      const resourceData = {
        name: entry.name,
        type: entry.initiatorType,  // script / css / img / fetch 等
        duration: entry.duration,
        transferSize: entry.transferSize,     // 传输大小
        decodedBodySize: entry.decodedBodySize, // 解码后大小
        dns: entry.domainLookupEnd - entry.domainLookupStart,
        tcp: entry.connectEnd - entry.connectStart,
        ttfb: entry.responseStart - entry.requestStart,
        download: entry.responseEnd - entry.responseStart,
      };

      // 只上报慢资源（> 1s）或大资源（> 500KB）
      if (entry.duration > 1000 || entry.transferSize > 500 * 1024) {
        report({ type: 'perf', subType: 'slow_resource', data: resourceData, timestamp: Date.now() });
      }
    }
  });

  observer.observe({ type: 'resource', buffered: true });
}

长任务检测：

sdk/plugins/long-task.ts
/** 长任务：阻塞主线程超过 50ms 的任务 */
function observeLongTasks(report: (data: ReportData) => void): void {
  if (!('PerformanceObserver' in window)) return;

  const observer = new PerformanceObserver((list) => {
    for (const entry of list.getEntries()) {
      report({
        type: 'perf',
        subType: 'long_task',
        data: {
          duration: entry.duration,
          startTime: entry.startTime,
          name: entry.name,
          // 标记当前页面 URL，便于定位问题页面
          pageUrl: window.location.href,
        },
        timestamp: Date.now(),
      });
    }
  });

  observer.observe({ type: 'longtask', buffered: true });
}

3.4 页面体验指标

除了标准 Web Vitals，业务中还需要一系列自定义体验指标来反映真实用户感受。

页面秒开率

秒开率是业务最关心的综合体验指标，统计口径为页面加载时间 ≤ N 秒的占比。

sdk/plugins/metrics/second-open.ts
interface SecondOpenMetric {
  /** 页面加载耗时（ms） */
  loadTime: number;
  /** 是否秒开（≤ 阈值） */
  isSecondOpen: boolean;
  /** 网络类型 */
  networkType: string;
}

function collectSecondOpen(
  report: (data: ReportData) => void,
  threshold = 1000 // 默认 1 秒
): void {
  const onLoad = (): void => {
    setTimeout(() => {
      const [nav] = performance.getEntriesByType(
        'navigation'
      ) as PerformanceNavigationTiming[];
      if (!nav) return;

      // 统计口径：fetchStart → loadEventEnd
      const loadTime = nav.loadEventEnd - nav.fetchStart;

      report({
        type: 'perf',
        subType: 'second_open',
        data: {
          loadTime,
          isSecondOpen: loadTime <= threshold,
          networkType: (navigator as any).connection?.effectiveType ?? 'unknown',
        },
        timestamp: Date.now(),
      });
    }, 0);
  };

  if (document.readyState === 'complete') onLoad();
  else window.addEventListener('load', onLoad);
}

秒开率的统计维度

秒开率需要分维度看才有意义：

按网络类型：4G / Wi-Fi / 3G 分别统计，3G 秒开率低是正常的
按页面类型：首页、详情页、列表页秒开阈值可能不同
按地域：不同地区 CDN 覆盖不同
按缓存状态：首次访问 vs 二次访问

常用阈值：1s（高要求）、2s（一般）、3s（低网速兜底）

首屏时间

首屏时间 ≠ LCP。LCP 只看最大内容元素，而首屏时间关注的是可视区域内所有关键内容加载完成的时间，包括图片。

sdk/plugins/metrics/first-screen.ts
/**
 * 首屏时间采集
 * 核心思路：MutationObserver 监听 DOM 变化，
 * 找到首屏区域内最后一个元素渲染完成的时间点
 */
function collectFirstScreenTime(report: (data: ReportData) => void): void {
  const viewportHeight = window.innerHeight;
  const viewportWidth = window.innerWidth;
  let lastMutationTime = performance.now();
  let rafId: number;
  let observerTimer: ReturnType<typeof setTimeout>;

  // 判断元素是否在首屏可视区域内
  function isInFirstScreen(el: Element): boolean {
    const rect = el.getBoundingClientRect();
    return (
      rect.top < viewportHeight &&
      rect.left < viewportWidth &&
      rect.bottom > 0 &&
      rect.right > 0 &&
      rect.width > 0 &&
      rect.height > 0
    );
  }

  // 获取首屏内所有图片的加载完成时间
  function getFirstScreenImageLoadTime(): Promise<number> {
    const images = Array.from(document.querySelectorAll('img')).filter(isInFirstScreen);

    if (images.length === 0) return Promise.resolve(0);

    return Promise.all(
      images.map(
        (img) =>
          new Promise<number>((resolve) => {
            if (img.complete) {
              // 已加载完成，用 Resource Timing 获取精确时间
              const entry = performance
                .getEntriesByName(img.src)
                .pop() as PerformanceResourceTiming | undefined;
              resolve(entry?.responseEnd ?? 0);
            } else {
              img.addEventListener('load', () => {
                const entry = performance
                  .getEntriesByName(img.src)
                  .pop() as PerformanceResourceTiming | undefined;
                resolve(entry?.responseEnd ?? performance.now());
              });
              img.addEventListener('error', () => resolve(0));
            }
          })
      )
    ).then((times) => Math.max(...times, 0));
  }

  // MutationObserver 监听 DOM 变化
  const observer = new MutationObserver(() => {
    lastMutationTime = performance.now();

    // 每次 DOM 变化后重置定时器
    clearTimeout(observerTimer);

    // 2 秒内无新 DOM 变化，认为首屏 DOM 渲染完成
    observerTimer = setTimeout(async () => {
      observer.disconnect();

      const domReadyTime = lastMutationTime;
      const imageLoadTime = await getFirstScreenImageLoadTime();

      // 首屏时间 = max(最后一次 DOM 变化时间, 首屏图片加载完成时间)
      const firstScreenTime = Math.max(domReadyTime, imageLoadTime);

      report({
        type: 'perf',
        subType: 'first_screen',
        data: {
          firstScreenTime: Math.round(firstScreenTime),
          domReadyTime: Math.round(domReadyTime),
          imageLoadTime: Math.round(imageLoadTime),
          imageCount: document.querySelectorAll('img').length,
        },
        timestamp: Date.now(),
      });
    }, 2000);
  });

  observer.observe(document.body, {
    childList: true,
    subtree: true,
  });

  // 兜底：最多等 10 秒
  setTimeout(() => observer.disconnect(), 10_000);
}

白屏检测

白屏不一定有 JS 报错（可能是空数据、CSS 加载失败、接口超时），需要主动检测页面是否有内容渲染。

sdk/plugins/metrics/white-screen.ts
/**
 * 白屏检测方案：采样点检测法
 * 在页面上均匀取若干采样点，检查这些点对应的元素是否为有效内容
 */
function detectWhiteScreen(report: (data: ReportData) => void): void {
  // 页面加载完成后延迟检测
  const check = (): void => {
    setTimeout(() => {
      // 在视口内均匀取 18 个采样点（横 6 × 纵 3）
      const samplePoints: Array<{ x: number; y: number }> = [];
      const cols = 6, rows = 3;
      const w = window.innerWidth, h = window.innerHeight;

      for (let i = 0; i < rows; i++) {
        for (let j = 0; j < cols; j++) {
          samplePoints.push({
            x: (w / (cols + 1)) * (j + 1),
            y: (h / (rows + 1)) * (i + 1),
          });
        }
      }

      // 无效元素：html, body, head 等"空"节点
      const wrapperTags = new Set(['html', 'body', 'head', 'meta', 'script', 'style', 'link']);
      let emptyCount = 0;

      for (const { x, y } of samplePoints) {
        const el = document.elementFromPoint(x, y);
        if (!el || wrapperTags.has(el.tagName.toLowerCase())) {
          emptyCount++;
        }
      }

      // 空点占比 > 90% 判定为白屏
      const isWhiteScreen = emptyCount / samplePoints.length > 0.9;

      if (isWhiteScreen) {
        report({
          type: 'perf',
          subType: 'white_screen',
          data: {
            emptyCount,
            totalPoints: samplePoints.length,
            emptyRate: emptyCount / samplePoints.length,
            url: window.location.href,
            // 记录 html 和 body 的子元素数量，辅助排查
            htmlChildren: document.documentElement.children.length,
            bodyChildren: document.body?.children.length ?? 0,
          },
          timestamp: Date.now(),
        });
      }
    }, 3000); // 延迟 3 秒检测，给页面渲染留足时间
  };

  if (document.readyState === 'complete') check();
  else window.addEventListener('load', check);
}

页面崩溃检测

JS 崩溃（页面完全无响应）时无法自我上报，需要外部心跳机制检测。

sdk/plugins/metrics/crash-detect.ts
/**
 * 方案 1：Service Worker 心跳检测
 * 页面定时向 SW 发心跳，SW 检测到心跳中断则判定崩溃
 */

// --- 页面端 ---
function startHeartbeat(): void {
  const HEARTBEAT_INTERVAL = 5000; // 5 秒一次心跳

  // 页面加载时标记为 "活着"
  sessionStorage.setItem('page_alive', '1');

  const timer = setInterval(() => {
    navigator.serviceWorker?.controller?.postMessage({
      type: 'HEARTBEAT',
      url: window.location.href,
      timestamp: Date.now(),
    });
  }, HEARTBEAT_INTERVAL);

  // 正常关闭时清理标记
  window.addEventListener('beforeunload', () => {
    clearInterval(timer);
    sessionStorage.removeItem('page_alive');
    navigator.serviceWorker?.controller?.postMessage({
      type: 'PAGE_UNLOAD',
      url: window.location.href,
    });
  });
}

// --- Service Worker 端 ---
const pageHeartbeats = new Map<string, number>(); // clientId -> lastHeartbeat
const HEARTBEAT_TIMEOUT = 15_000; // 15 秒无心跳判定崩溃

self.addEventListener('message', (event: ExtendableMessageEvent) => {
  const clientId = (event.source as Client)?.id;
  if (!clientId) return;

  if (event.data.type === 'HEARTBEAT') {
    pageHeartbeats.set(clientId, Date.now());
  }

  if (event.data.type === 'PAGE_UNLOAD') {
    pageHeartbeats.delete(clientId);
  }
});

// 定时检查心跳超时
setInterval(() => {
  const now = Date.now();
  pageHeartbeats.forEach((lastTime, clientId) => {
    if (now - lastTime > HEARTBEAT_TIMEOUT) {
      // 判定为崩溃，通过 fetch 上报
      fetch('/api/report', {
        method: 'POST',
        body: JSON.stringify({
          type: 'error',
          subType: 'page_crash',
          data: { clientId, lastHeartbeat: lastTime },
          timestamp: now,
        }),
      });
      pageHeartbeats.delete(clientId);
    }
  });
}, 10_000);

sdk/plugins/metrics/crash-detect-simple.ts
/**
 * 方案 2：localStorage 标记法（不依赖 Service Worker）
 * 页面加载时标记 "loading"，加载完改为 "loaded"，
 * 下次进入时检查上次是否正常结束
 */
function detectCrashByStorage(report: (data: ReportData) => void): void {
  const KEY = 'page_lifecycle_state';
  const prevState = localStorage.getItem(KEY);

  // 上次状态不是 "closed"，说明页面异常退出（崩溃或被杀）
  if (prevState && prevState !== 'closed') {
    report({
      type: 'error',
      subType: 'page_crash',
      data: {
        prevState,
        prevUrl: localStorage.getItem('page_lifecycle_url') ?? '',
      },
      timestamp: Date.now(),
    });
  }

  // 标记当前页面状态
  localStorage.setItem(KEY, 'active');
  localStorage.setItem('page_lifecycle_url', window.location.href);

  window.addEventListener('beforeunload', () => {
    localStorage.setItem(KEY, 'closed');
  });
}

两种崩溃检测方案对比

方案	优点	缺点
SW 心跳	实时检测，可以估算崩溃时间	依赖 Service Worker，iOS Safari 限制多
localStorage 标记	简单、兼容性好	只能在下次访问时检测到，无法实时告警

建议两种方案结合使用：有 SW 的场景用心跳，没有的用标记法兜底。

FMP（首次有意义绘制）

FMP 不是 Web 标准指标，但比 FCP 更贴近业务体感。思路是找到DOM 变化权重最大的时间点。

sdk/plugins/metrics/fmp.ts
/**
 * FMP 近似计算
 * 监听 DOM 变化，计算每次变化增加的 DOM "权重"（元素面积），
 * 权重增长最大的时刻即为 FMP
 */
function collectFMP(report: (data: ReportData) => void): void {
  let maxWeight = 0;
  let fmpTime = 0;

  const observer = new MutationObserver(() => {
    // 计算当前可视区域内所有元素的面积总和
    const weight = calculateVisibleWeight();

    if (weight > maxWeight) {
      maxWeight = weight;
      fmpTime = performance.now();
    }
  });

  observer.observe(document.body, { childList: true, subtree: true });

  // 5 秒后结束观测并上报
  setTimeout(() => {
    observer.disconnect();
    if (fmpTime > 0) {
      report({
        type: 'perf',
        subType: 'fmp',
        data: { fmpTime: Math.round(fmpTime) },
        timestamp: Date.now(),
      });
    }
  }, 5000);
}

function calculateVisibleWeight(): number {
  const viewportH = window.innerHeight;
  let totalArea = 0;

  document.body.querySelectorAll('*').forEach((el) => {
    const rect = (el as HTMLElement).getBoundingClientRect();
    if (rect.top < viewportH && rect.bottom > 0 && rect.width > 0 && rect.height > 0) {
      totalArea += rect.width * Math.min(rect.height, viewportH - rect.top);
    }
  });

  return totalArea;
}

TTI（可交互时间）

TTI 表示页面从加载到真正可以流畅交互的时间。Google 已废弃 TTI 转向 INP，但面试仍常问。

sdk/plugins/metrics/tti.ts
/**
 * TTI 近似计算
 * 定义：FCP 之后，连续 5 秒没有长任务（> 50ms）且网络空闲
 */
function collectTTI(report: (data: ReportData) => void): void {
  let fcpTime = 0;
  let lastLongTaskEnd = 0;

  // 1. 获取 FCP 时间
  const fcpObserver = new PerformanceObserver((list) => {
    for (const entry of list.getEntries()) {
      if (entry.name === 'first-contentful-paint') {
        fcpTime = entry.startTime;
      }
    }
  });
  fcpObserver.observe({ type: 'paint', buffered: true });

  // 2. 追踪长任务
  const ltObserver = new PerformanceObserver((list) => {
    for (const entry of list.getEntries()) {
      lastLongTaskEnd = entry.startTime + entry.duration;
    }
  });
  ltObserver.observe({ type: 'longtask', buffered: true });

  // 3. 5 秒静默窗口检测
  const checkTTI = (): void => {
    const now = performance.now();
    const quietWindowStart = Math.max(fcpTime, lastLongTaskEnd);

    // FCP 之后，距离最后一个长任务结束超过 5 秒
    if (fcpTime > 0 && now - quietWindowStart >= 5000) {
      const tti = quietWindowStart;
      fcpObserver.disconnect();
      ltObserver.disconnect();

      report({
        type: 'perf',
        subType: 'tti',
        data: { tti: Math.round(tti) },
        timestamp: Date.now(),
      });
    } else if (now < 30_000) {
      // 最多观测 30 秒
      setTimeout(checkTTI, 1000);
    }
  };

  setTimeout(checkTTI, 5000);
}

3.5 视频/直播监控指标

视频和直播场景有独特的体验指标体系，是音视频团队面试的重点。

视频秒开 & 首帧时间

sdk/plugins/video/video-first-frame.ts
interface VideoStartMetric {
  /** 从点击播放到首帧渲染的时间（ms） */
  firstFrameTime: number;
  /** 是否秒开 */
  isSecondOpen: boolean;
  /** 拉流耗时：请求 M3U8 到收到第一个 TS 分片 */
  pullStreamTime: number;
  /** 解码耗时：首个分片到首帧渲染 */
  decodeTime: number;
  /** 播放器类型 */
  playerType: string;
}

function collectVideoFirstFrame(
  video: HTMLVideoElement,
  report: (data: ReportData) => void
): void {
  let playClickTime = 0;
  let pullStreamStart = 0;
  let pullStreamEnd = 0;

  // 统计口径：play() 调用 → 首帧渲染
  video.addEventListener('play', () => {
    playClickTime = performance.now();
    pullStreamStart = playClickTime;
  });

  // loadeddata: 第一帧数据就绪（拉流完成）
  video.addEventListener('loadeddata', () => {
    pullStreamEnd = performance.now();
  });

  // 方案 1: requestVideoFrameCallback（精确到帧）
  if ('requestVideoFrameCallback' in video) {
    video.addEventListener('play', () => {
      (video as any).requestVideoFrameCallback(
        (_now: number, metadata: { presentedFrames: number }) => {
          if (metadata.presentedFrames <= 1) {
            reportFirstFrame(performance.now());
          }
        }
      );
    });
  } else {
    // 方案 2: timeupdate 兜底
    const onTimeUpdate = (): void => {
      if (video.currentTime > 0) {
        reportFirstFrame(performance.now());
        video.removeEventListener('timeupdate', onTimeUpdate);
      }
    };
    video.addEventListener('timeupdate', onTimeUpdate);
  }

  function reportFirstFrame(renderTime: number): void {
    const firstFrameTime = renderTime - playClickTime;
    report({
      type: 'perf',
      subType: 'video_first_frame',
      data: {
        firstFrameTime: Math.round(firstFrameTime),
        isSecondOpen: firstFrameTime <= 1000,
        pullStreamTime: Math.round(pullStreamEnd - pullStreamStart),
        decodeTime: Math.round(renderTime - pullStreamEnd),
        playerType: detectPlayerType(video),
      },
      timestamp: Date.now(),
    });
  }
}

function detectPlayerType(video: HTMLVideoElement): string {
  const src = video.src || video.currentSrc || '';
  if (src.includes('.m3u8')) return 'HLS';
  if (src.includes('.mpd')) return 'DASH';
  if (src.includes('.flv')) return 'FLV';
  return 'Native';
}

视频秒开优化思路

首帧时间 = 拉流耗时 + 解码耗时 + 渲染耗时，对应的优化方向：

拉流：预加载 M3U8、CDN 预热、使用 HTTP/2
解码：选择合适的编码格式、降低首个 GOP 的大小
渲染：确保视频元素尺寸预设、避免首帧黑屏（先展示封面图）

卡顿检测

sdk/plugins/video/video-stall.ts
interface StallMetric {
  /** 卡顿次数 */
  stallCount: number;
  /** 总卡顿时长（ms） */
  stallDuration: number;
  /** 总观看时长（ms） */
  watchDuration: number;
  /** 卡顿率 = stallDuration / watchDuration */
  stallRate: number;
  /** 每次卡顿的详细信息 */
  stallDetails: Array<{ startTime: number; duration: number }>;
}

function collectVideoStall(
  video: HTMLVideoElement,
  report: (data: ReportData) => void
): void {
  let stallCount = 0;
  let totalStallDuration = 0;
  let stallStartTime = 0;
  let watchStartTime = 0;
  const stallDetails: Array<{ startTime: number; duration: number }> = [];

  // waiting 事件：播放器因数据不足暂停（缓冲卡顿）
  video.addEventListener('waiting', () => {
    stallStartTime = performance.now();
    stallCount++;
  });

  // playing 事件：从 waiting 恢复播放
  video.addEventListener('playing', () => {
    if (stallStartTime > 0) {
      const duration = performance.now() - stallStartTime;
      totalStallDuration += duration;
      stallDetails.push({
        startTime: stallStartTime,
        duration: Math.round(duration),
      });
      stallStartTime = 0;
    }
  });

  video.addEventListener('play', () => {
    watchStartTime = performance.now();
  });

  // 播放结束或页面离开时上报
  const reportStall = (): void => {
    if (watchStartTime === 0) return;
    const watchDuration = performance.now() - watchStartTime;

    report({
      type: 'perf',
      subType: 'video_stall',
      data: {
        stallCount,
        stallDuration: Math.round(totalStallDuration),
        watchDuration: Math.round(watchDuration),
        stallRate: watchDuration > 0
          ? Number((totalStallDuration / watchDuration).toFixed(4))
          : 0,
        stallDetails,
      },
      timestamp: Date.now(),
    });
  };

  video.addEventListener('ended', reportStall);
  video.addEventListener('pause', reportStall);
  document.addEventListener('visibilitychange', () => {
    if (document.visibilityState === 'hidden') reportStall();
  });
}

黑屏检测

视频有画面输出但画面全黑，通常是编码异常、解密失败或 CDN 源流问题。

sdk/plugins/video/video-black-screen.ts
/**
 * 黑屏检测：通过 Canvas 截取视频帧像素，统计亮度判断是否黑屏
 */
function detectBlackScreen(
  video: HTMLVideoElement,
  report: (data: ReportData) => void,
  options = { checkInterval: 5000, brightnessThreshold: 10, sampleSize: 64 }
): () => void {
  const canvas = document.createElement('canvas');
  const ctx = canvas.getContext('2d')!;
  canvas.width = options.sampleSize;
  canvas.height = options.sampleSize;

  let consecutiveBlackFrames = 0;

  const timer = setInterval(() => {
    // 只在播放中检测
    if (video.paused || video.ended || video.readyState < 2) return;

    // 将当前帧绘制到缩小的 Canvas 上
    ctx.drawImage(video, 0, 0, options.sampleSize, options.sampleSize);
    const imageData = ctx.getImageData(
      0, 0, options.sampleSize, options.sampleSize
    );
    const pixels = imageData.data;

    // 计算平均亮度（只取 R 通道的近似值）
    let totalBrightness = 0;
    const pixelCount = pixels.length / 4;

    for (let i = 0; i < pixels.length; i += 4) {
      // 亮度公式简化：0.299R + 0.587G + 0.114B
      totalBrightness +=
        pixels[i] * 0.299 + pixels[i + 1] * 0.587 + pixels[i + 2] * 0.114;
    }

    const avgBrightness = totalBrightness / pixelCount;

    if (avgBrightness < options.brightnessThreshold) {
      consecutiveBlackFrames++;

      // 连续 3 次检测为黑屏才上报（避免场景切换的误报）
      if (consecutiveBlackFrames >= 3) {
        report({
          type: 'error',
          subType: 'video_black_screen',
          data: {
            avgBrightness: Number(avgBrightness.toFixed(2)),
            currentTime: video.currentTime,
            videoSrc: video.currentSrc,
            consecutiveCount: consecutiveBlackFrames,
          },
          timestamp: Date.now(),
        });
        consecutiveBlackFrames = 0; // 上报后重置
      }
    } else {
      consecutiveBlackFrames = 0;
    }
  }, options.checkInterval);

  // 返回清理函数
  return () => clearInterval(timer);
}

音画同步检测

sdk/plugins/video/av-sync.ts
/**
 * 音画同步检测
 * 利用 requestVideoFrameCallback 获取视频帧的精确时间，
 * 与 AudioContext 的时间对比
 */
function detectAVSync(
  video: HTMLVideoElement,
  report: (data: ReportData) => void
): void {
  if (!('requestVideoFrameCallback' in video)) return;

  const checkSync = (
    _now: DOMHighResTimeStamp,
    metadata: { mediaTime: number; expectedDisplayTime: number }
  ): void => {
    // 音频时间（video.currentTime 反映音频轨道的时间）
    const audioTime = video.currentTime;
    // 视频帧的媒体时间
    const videoTime = metadata.mediaTime;

    const drift = Math.abs(audioTime - videoTime) * 1000; // 转为 ms

    // 超过 200ms 用户可感知不同步
    if (drift > 200) {
      report({
        type: 'perf',
        subType: 'av_sync_drift',
        data: {
          drift: Math.round(drift),
          audioTime,
          videoTime,
          currentTime: video.currentTime,
        },
        timestamp: Date.now(),
      });
    }

    // 持续监测
    (video as any).requestVideoFrameCallback(checkSync);
  };

  (video as any).requestVideoFrameCallback(checkSync);
}

码率切换与播放失败

sdk/plugins/video/video-quality.ts
/**
 * HLS 码率切换监控（以 hls.js 为例）
 */
function collectBitrateSwitch(
  hls: any, // Hls instance
  report: (data: ReportData) => void
): void {
  let switchCount = 0;
  let lastLevel = -1;

  hls.on('hlsLevelSwitched', (_event: string, data: { level: number }) => {
    if (lastLevel !== -1 && lastLevel !== data.level) {
      switchCount++;
      const levelDetail = hls.levels[data.level];

      report({
        type: 'perf',
        subType: 'bitrate_switch',
        data: {
          fromLevel: lastLevel,
          toLevel: data.level,
          bitrate: levelDetail?.bitrate ?? 0,
          resolution: `${levelDetail?.width ?? 0}x${levelDetail?.height ?? 0}`,
          switchCount,
        },
        timestamp: Date.now(),
      });
    }
    lastLevel = data.level;
  });
}

/** 播放失败监控 */
function collectPlayError(
  video: HTMLVideoElement,
  report: (data: ReportData) => void
): void {
  video.addEventListener('error', () => {
    const error = video.error;
    if (!error) return;

    // MediaError.code: 1=ABORTED, 2=NETWORK, 3=DECODE, 4=SRC_NOT_SUPPORTED
    const errorTypeMap: Record<number, string> = {
      1: 'MEDIA_ERR_ABORTED',
      2: 'MEDIA_ERR_NETWORK',
      3: 'MEDIA_ERR_DECODE',
      4: 'MEDIA_ERR_SRC_NOT_SUPPORTED',
    };

    report({
      type: 'error',
      subType: 'video_play_error',
      data: {
        code: error.code,
        type: errorTypeMap[error.code] ?? 'UNKNOWN',
        message: error.message,
        src: video.currentSrc,
        currentTime: video.currentTime,
        networkState: video.networkState,
        readyState: video.readyState,
      },
      timestamp: Date.now(),
    });
  });
}

视频监控指标汇总

指标	统计口径	阈值参考
首帧时间	`play` → `requestVideoFrameCallback` 首次回调	≤ 1s（秒开）
视频秒开率	首帧 ≤ 1s 的播放次数 / 总播放次数	≥ 90%
卡顿率	总 `waiting` 时长 / 总观看时长	≤ 1%
卡顿次数	`waiting` 事件触发次数 / 分钟	≤ 0.5 次/分钟
黑屏	Canvas 截帧平均亮度 < 10，连续 3 次	0 次
音画不同步	音频与视频帧时间差	≤ 200ms
码率切换频率	切换次数 / 分钟	≤ 2 次/分钟
播放失败率	`error` 事件 / 总播放次数	≤ 0.5%

3.6 补充通用指标

接口成功率与耗时分布

sdk/plugins/metrics/api-metrics.ts
interface APIMetric {
  url: string;
  method: string;
  status: number;
  duration: number;
  isSuccess: boolean;
  /** 响应体大小（bytes） */
  responseSize: number;
}

/** 按接口维度统计成功率和耗时分位数 */
function collectAPIMetrics(report: (data: ReportData) => void): void {
  const originalFetch = window.fetch;

  window.fetch = async (
    input: RequestInfo | URL,
    init?: RequestInit
  ): Promise<Response> => {
    const startTime = performance.now();
    const url = typeof input === 'string'
      ? input
      : input instanceof URL ? input.href : input.url;
    const method = init?.method ?? 'GET';

    try {
      const response = await originalFetch(input, init);
      const duration = performance.now() - startTime;

      report({
        type: 'perf',
        subType: 'api_metric',
        data: {
          url: new URL(url, window.location.origin).pathname, // 只取 pathname，去掉查询参数
          method: method.toUpperCase(),
          status: response.status,
          duration: Math.round(duration),
          isSuccess: response.ok,
          responseSize: Number(response.headers.get('content-length') ?? 0),
        },
        timestamp: Date.now(),
      });

      return response;
    } catch (error) {
      const duration = performance.now() - startTime;
      report({
        type: 'perf',
        subType: 'api_metric',
        data: {
          url: new URL(url, window.location.origin).pathname,
          method: method.toUpperCase(),
          status: 0, // 网络错误
          duration: Math.round(duration),
          isSuccess: false,
          responseSize: 0,
        },
        timestamp: Date.now(),
      });
      throw error;
    }
  };
}

SPA 路由切换耗时

sdk/plugins/metrics/spa-navigation.ts
/**
 * SPA 路由切换耗时
 * 统计口径：pushState/popstate → 下一次 LCP（或 DOM 稳定）
 */
function collectSPANavigation(report: (data: ReportData) => void): void {
  let routeChangeTime = 0;

  const onRouteChange = (): void => {
    routeChangeTime = performance.now();

    // 监听 DOM 变化稳定
    let lastMutationTime = routeChangeTime;
    const observer = new MutationObserver(() => {
      lastMutationTime = performance.now();
    });

    observer.observe(document.body, { childList: true, subtree: true });

    // 1 秒内无新 DOM 变化，认为路由页面渲染完成
    const check = (): void => {
      if (performance.now() - lastMutationTime >= 1000) {
        observer.disconnect();
        const duration = lastMutationTime - routeChangeTime;

        report({
          type: 'perf',
          subType: 'spa_navigation',
          data: {
            from: document.referrer,
            to: window.location.pathname,
            duration: Math.round(duration),
          },
          timestamp: Date.now(),
        });
      } else {
        setTimeout(check, 500);
      }
    };

    setTimeout(check, 1000);
  };

  // 劫持 pushState
  const originPush = history.pushState;
  history.pushState = function (...args) {
    originPush.apply(this, args);
    onRouteChange();
  };

  window.addEventListener('popstate', onRouteChange);
}

帧率（FPS）监控

sdk/plugins/metrics/fps-monitor.ts
/**
 * FPS 监控
 * 利用 requestAnimationFrame 计算每秒回调次数
 * 连续低于 30fps 超过 3 秒视为卡顿
 */
function collectFPS(report: (data: ReportData) => void): void {
  let frameCount = 0;
  let lastTime = performance.now();
  let lowFpsStart = 0;

  const loop = (): void => {
    frameCount++;
    const now = performance.now();

    if (now - lastTime >= 1000) {
      const fps = Math.round((frameCount * 1000) / (now - lastTime));
      frameCount = 0;
      lastTime = now;

      if (fps < 30) {
        if (lowFpsStart === 0) lowFpsStart = now;

        // 连续低帧率超过 3 秒上报
        if (now - lowFpsStart >= 3000) {
          report({
            type: 'perf',
            subType: 'low_fps',
            data: {
              fps,
              duration: Math.round(now - lowFpsStart),
              url: window.location.href,
            },
            timestamp: Date.now(),
          });
          lowFpsStart = now; // 重置，避免持续上报
        }
      } else {
        lowFpsStart = 0;
      }
    }
    requestAnimationFrame(loop);
  };

  requestAnimationFrame(loop);
}

内存占用监控

sdk/plugins/metrics/memory-monitor.ts
/**
 * 内存占用监控（仅 Chrome）
 * 定期采样 performance.memory，检测内存泄漏趋势
 */
function collectMemoryUsage(report: (data: ReportData) => void): void {
  // @ts-expect-error performance.memory 只在 Chrome 中存在
  if (!performance.memory) return;

  const samples: Array<{ time: number; used: number }> = [];

  setInterval(() => {
    // @ts-expect-error
    const { usedJSHeapSize, totalJSHeapSize, jsHeapSizeLimit } = performance.memory;

    samples.push({
      time: Date.now(),
      used: usedJSHeapSize,
    });

    // 保留最近 20 个采样点
    if (samples.length > 20) samples.shift();

    // 检测内存持续增长趋势（简单线性回归斜率 > 0）
    if (samples.length >= 10) {
      const isLeaking = detectLeakTrend(samples);

      if (isLeaking) {
        report({
          type: 'perf',
          subType: 'memory_warning',
          data: {
            usedJSHeapSize,
            totalJSHeapSize,
            jsHeapSizeLimit,
            usageRate: Number((usedJSHeapSize / jsHeapSizeLimit).toFixed(4)),
            trend: 'increasing',
          },
          timestamp: Date.now(),
        });
      }
    }
  }, 30_000); // 每 30 秒采样一次
}

function detectLeakTrend(samples: Array<{ time: number; used: number }>): boolean {
  // 简单策略：如果最近 10 个采样点中，后 5 个的平均值比前 5 个大 20%
  const mid = Math.floor(samples.length / 2);
  const firstHalf = samples.slice(0, mid);
  const secondHalf = samples.slice(mid);

  const avgFirst = firstHalf.reduce((sum, s) => sum + s.used, 0) / firstHalf.length;
  const avgSecond = secondHalf.reduce((sum, s) => sum + s.used, 0) / secondHalf.length;

  return avgSecond > avgFirst * 1.2;
}

3.7 用户行为追踪模块

PV/UV 统计：

sdk/plugins/behavior/pv-uv.ts
function initPVTracker(report: (data: ReportData) => void): void {
  // 1. 页面加载时上报 PV
  report({
    type: 'behavior',
    subType: 'page_view',
    data: {
      title: document.title,
      path: window.location.pathname,
      referrer: document.referrer,
      screenResolution: `${screen.width}x${screen.height}`,
    },
    timestamp: Date.now(),
  });

  // 2. SPA 路由变化时上报 PV
  // 监听 pushState
  const originPushState = history.pushState;
  history.pushState = function (...args) {
    originPushState.apply(this, args);
    onRouteChange();
  };

  // 监听 popstate（浏览器前进/后退）
  window.addEventListener('popstate', onRouteChange);

  function onRouteChange(): void {
    report({
      type: 'behavior',
      subType: 'page_view',
      data: {
        title: document.title,
        path: window.location.pathname,
        referrer: document.referrer,
      },
      timestamp: Date.now(),
    });
  }
}

页面停留时长：

sdk/plugins/behavior/page-stay.ts
function initPageStayTracker(report: (data: ReportData) => void): void {
  let enterTime = Date.now();
  let currentPath = window.location.pathname;

  // 方式 1：页面可见性变化（最可靠）
  document.addEventListener('visibilitychange', () => {
    if (document.visibilityState === 'hidden') {
      const duration = Date.now() - enterTime;
      report({
        type: 'behavior',
        subType: 'page_stay',
        data: { path: currentPath, duration },
        timestamp: Date.now(),
      });
    } else {
      // 页面重新可见，重置计时
      enterTime = Date.now();
    }
  });

  // 方式 2：SPA 路由切换
  const originPushState = history.pushState;
  history.pushState = function (...args) {
    const duration = Date.now() - enterTime;
    report({
      type: 'behavior',
      subType: 'page_stay',
      data: { path: currentPath, duration },
      timestamp: Date.now(),
    });
    originPushState.apply(this, args);
    enterTime = Date.now();
    currentPath = window.location.pathname;
  };
}

点击热图数据采集：

sdk/plugins/behavior/click-heatmap.ts
function initClickHeatmap(report: (data: ReportData) => void): void {
  document.addEventListener('click', (event: MouseEvent) => {
    const target = event.target as HTMLElement;
    if (!target) return;

    report({
      type: 'behavior',
      subType: 'click',
      data: {
        // 页面坐标（绝对位置，用于热力图渲染）
        pageX: event.pageX,
        pageY: event.pageY,
        // 视口坐标
        clientX: event.clientX,
        clientY: event.clientY,
        // 页面总高度（用于计算点击位置百分比）
        pageHeight: document.documentElement.scrollHeight,
        // 元素标识
        tagName: target.tagName.toLowerCase(),
        className: target.className,
        id: target.id,
        text: (target.textContent ?? '').trim().slice(0, 50),
        // data-track-id 作为业务标识
        trackId: target.dataset?.trackId ?? '',
        // XPath 唯一路径
        xpath: getXPath(target),
      },
      timestamp: Date.now(),
    });
  }, true);
}

/** 生成元素的简化 XPath */
function getXPath(element: HTMLElement): string {
  const paths: string[] = [];
  let current: HTMLElement | null = element;

  while (current && current !== document.body) {
    const tagName = current.tagName.toLowerCase();
    const parent = current.parentElement;
    if (parent) {
      const siblings = Array.from(parent.children).filter(
        (child) => child.tagName === current!.tagName
      );
      const index = siblings.indexOf(current) + 1;
      paths.unshift(siblings.length > 1 ? `${tagName}[${index}]` : tagName);
    } else {
      paths.unshift(tagName);
    }
    current = parent;
  }

  return '/body/' + paths.join('/');
}

用户路径追踪：

sdk/plugins/behavior/user-path.ts
interface PathNode {
  path: string;
  title: string;
  enterTime: number;
  leaveTime?: number;
  duration?: number;
}

class UserPathTracker {
  private pathStack: PathNode[] = [];
  private maxPathLength = 20; // 最多记录最近 20 个页面

  enter(path: string, title: string): void {
    // 先结束上一个页面
    const last = this.pathStack[this.pathStack.length - 1];
    if (last && !last.leaveTime) {
      last.leaveTime = Date.now();
      last.duration = last.leaveTime - last.enterTime;
    }

    this.pathStack.push({
      path,
      title,
      enterTime: Date.now(),
    });

    // 超过最大长度时移除最早的记录
    if (this.pathStack.length > this.maxPathLength) {
      this.pathStack.shift();
    }
  }

  /** 获取完整用户路径，用于错误上下文或离开时上报 */
  getPath(): PathNode[] {
    return [...this.pathStack];
  }
}

3.8 数据上报策略

数据上报是连接客户端采集和服务端处理的桥梁，需要同时兼顾可靠性和性能。

上报方式对比：

上报方式	跨域	卸载可靠性	数据量	适用场景
`sendBeacon`	支持	高	64KB	批量上报、页面卸载
`fetch(keepalive)`	需 CORS	中	64KB	大数据、需要响应
`img` 标签	天然跨域	低	URL 长度	简单指标、强兼容
`XMLHttpRequest`	需 CORS	低	无限制	降级方案

完整的上报模块实现：

sdk/reporter.ts
class Reporter {
  private buffer: ReportData[] = [];
  private config: Required<MonitorConfig>;
  private timer: ReturnType<typeof setInterval> | null = null;
  private retryKey = '__monitor_retry_queue__';

  constructor(config: Required<MonitorConfig>) {
    this.config = config;
    this.startTimer();
    this.bindPageHide();
    this.retryFailedData();
  }

  /** 写入缓冲队列 */
  add(data: ReportData): void {
    // 采样判断
    if (Math.random() > this.config.sampleRate) return;

    // beforeSend 钩子
    const processed = this.config.beforeSend?.(data) ?? data;
    if (!processed) return;

    this.buffer.push(processed);

    // 缓冲区满则立即上报
    if (this.buffer.length >= this.config.maxBufferSize) {
      this.flush();
    }
  }

  /** 执行批量上报 */
  private flush(): void {
    if (this.buffer.length === 0) return;

    const batch = [...this.buffer];
    this.buffer = [];

    this.send(batch);
  }

  /** 多级降级发送 */
  private send(data: ReportData[]): void {
    const payload = JSON.stringify(data);
    const blob = new Blob([payload], { type: 'application/json' });

    // 优先 sendBeacon
    if (navigator.sendBeacon) {
      const success = navigator.sendBeacon(this.config.reportUrl, blob);
      if (success) return;
    }

    // 降级 fetch + keepalive
    fetch(this.config.reportUrl, {
      method: 'POST',
      body: blob,
      keepalive: true,
    }).catch(() => {
      // 再降级：存入 localStorage 等待重试
      this.saveToRetryQueue(data);
    });
  }

  /** 离线缓存，下次访问时重试 */
  private saveToRetryQueue(data: ReportData[]): void {
    try {
      const existing = JSON.parse(
        localStorage.getItem(this.retryKey) ?? '[]'
      ) as ReportData[];
      // 限制缓存最多 100 条，防止撑爆 localStorage
      const merged = [...existing, ...data].slice(-100);
      localStorage.setItem(this.retryKey, JSON.stringify(merged));
    } catch {
      // localStorage 不可用时静默失败
    }
  }

  /** 重试之前失败的数据 */
  private retryFailedData(): void {
    try {
      const data = JSON.parse(
        localStorage.getItem(this.retryKey) ?? '[]'
      ) as ReportData[];
      if (data.length > 0) {
        localStorage.removeItem(this.retryKey);
        // 使用 requestIdleCallback 在空闲时重试，不影响页面加载
        const retry = (): void => this.send(data);
        if ('requestIdleCallback' in window) {
          requestIdleCallback(retry);
        } else {
          setTimeout(retry, 2000);
        }
      }
    } catch {
      // 静默失败
    }
  }

  /** 定时上报 */
  private startTimer(): void {
    this.timer = setInterval(() => this.flush(), this.config.flushInterval);
  }

  /** 页面卸载时强制上报 */
  private bindPageHide(): void {
    document.addEventListener('visibilitychange', () => {
      if (document.visibilityState === 'hidden') {
        this.flush();
      }
    });
  }

  destroy(): void {
    if (this.timer) clearInterval(this.timer);
    this.flush();
  }
}

上报策略三要素

面试中回答上报策略时，抓住三个关键词即可：批量合并（减少请求数）、多级降级（sendBeacon -> fetch -> localStorage）、页面卸载兜底（visibilitychange + sendBeacon）。

采样率控制：

sdk/sampler.ts
interface SamplingConfig {
  /** 全局采样率 0~1 */
  globalRate: number;
  /** 按事件类型独立设置采样率 */
  eventRates?: Record<string, number>;
  /** 强制 100% 上报的事件类型 */
  forceReportTypes?: string[];
}

function shouldReport(
  subType: string,
  config: SamplingConfig
): boolean {
  // 关键事件强制上报
  if (config.forceReportTypes?.includes(subType)) {
    return true;
  }

  // 优先使用事件级采样率，否则用全局采样率
  const rate = config.eventRates?.[subType] ?? config.globalRate;
  return Math.random() < rate;
}

// 使用示例
const samplingConfig: SamplingConfig = {
  globalRate: 0.1,           // 默认 10% 采样
  eventRates: {
    js_error: 1.0,           // 错误 100% 上报
    promise_error: 1.0,
    resource_error: 1.0,
    lcp: 0.5,                // 性能指标 50% 采样
    page_view: 0.3,          // PV 30% 采样
    click: 0.01,             // 点击 1% 采样
  },
  forceReportTypes: ['js_error', 'promise_error', 'react_error'],
};

四、关键技术实现

4.1 Monitor SDK 完整实现

sdk/monitor.ts
class MonitorSDK {
  private config: Required<MonitorConfig>;
  private reporter: Reporter;
  private plugins: MonitorPlugin[] = [];
  private userId = '';
  private sessionId: string;

  constructor(config: MonitorConfig) {
    this.config = {
      sampleRate: 1,
      maxBufferSize: 10,
      flushInterval: 5000,
      plugins: [],
      beforeSend: (data: ReportData) => data,
      ...config,
    };

    this.sessionId = this.generateSessionId();
    this.reporter = new Reporter(this.config);

    // 安装插件
    this.config.plugins.forEach((plugin) => this.use(plugin));

    // 挂载到全局，便于其他模块访问
    (window as unknown as Record<string, unknown>).__MONITOR__ = this;
  }

  /** 注册插件 */
  use(plugin: MonitorPlugin): void {
    if (this.plugins.some((p) => p.name === plugin.name)) {
      console.warn(`[Monitor] Plugin "${plugin.name}" already registered`);
      return;
    }
    this.plugins.push(plugin);
    plugin.setup(this.createContext());
  }

  /** 创建插件上下文 */
  private createContext(): MonitorContext {
    return {
      report: (data: ReportData) => this.report(data),
      getConfig: () => ({ ...this.config }),
      getCommonFields: () => ({
        appId: this.config.appId,
        userId: this.userId,
        sessionId: this.sessionId,
      }),
    };
  }

  /** 上报数据（供插件调用） */
  report(data: ReportData): void {
    const enrichedData: ReportData = {
      ...data,
      appId: this.config.appId,
      userId: this.userId,
      sessionId: this.sessionId,
      pageUrl: window.location.href,
      userAgent: navigator.userAgent,
    };

    this.reporter.add(enrichedData);
  }

  /** 设置用户标识 */
  setUser(userId: string): void {
    this.userId = userId;
  }

  /** 手动上报自定义事件 */
  trackEvent(eventName: string, data?: Record<string, unknown>): void {
    this.report({
      type: 'custom',
      subType: eventName,
      data: data ?? {},
      timestamp: Date.now(),
    });
  }

  private generateSessionId(): string {
    return `${Date.now()}-${Math.random().toString(36).slice(2, 10)}`;
  }

  destroy(): void {
    this.plugins.forEach((p) => p.destroy?.());
    this.reporter.destroy();
  }
}

4.2 ErrorPlugin 完整实现

sdk/plugins/error-plugin.ts
const ErrorPlugin: MonitorPlugin = {
  name: 'error',

  setup(ctx: MonitorContext): void {
    // 1. JS 运行时错误
    window.onerror = (
      message: string | Event,
      source?: string,
      lineno?: number,
      colno?: number,
      error?: Error
    ): boolean => {
      ctx.report({
        type: 'error',
        subType: 'js_error',
        data: {
          message: typeof message === 'string' ? message : message.type,
          source: source ?? '',
          lineno: lineno ?? 0,
          colno: colno ?? 0,
          stack: error?.stack ?? '',
        },
        timestamp: Date.now(),
      });
      return true;
    };

    // 2. 未捕获的 Promise 异常
    window.addEventListener('unhandledrejection', (event: PromiseRejectionEvent) => {
      const reason = event.reason;
      ctx.report({
        type: 'error',
        subType: 'promise_error',
        data: {
          message: reason instanceof Error ? reason.message : String(reason),
          stack: reason instanceof Error ? reason.stack ?? '' : '',
        },
        timestamp: Date.now(),
      });
    });

    // 3. 资源加载错误（必须在捕获阶段监听）
    window.addEventListener('error', (event: ErrorEvent | Event) => {
      const target = event.target as HTMLElement;
      if (target && (
        target instanceof HTMLScriptElement ||
        target instanceof HTMLLinkElement ||
        target instanceof HTMLImageElement
      )) {
        ctx.report({
          type: 'error',
          subType: 'resource_error',
          data: {
            tagName: target.tagName.toLowerCase(),
            src: (target as HTMLImageElement).src
              || (target as HTMLLinkElement).href
              || '',
          },
          timestamp: Date.now(),
        });
      }
    }, true);

    // 4. 拦截 fetch 监控接口异常
    const originalFetch = window.fetch;
    window.fetch = async (
      input: RequestInfo | URL,
      init?: RequestInit
    ): Promise<Response> => {
      const startTime = Date.now();
      const url = typeof input === 'string'
        ? input
        : input instanceof URL ? input.href : input.url;

      try {
        const response = await originalFetch(input, init);
        const duration = Date.now() - startTime;

        if (!response.ok) {
          ctx.report({
            type: 'error',
            subType: 'api_error',
            data: { url, status: response.status, duration },
            timestamp: Date.now(),
          });
        }

        // 同时上报接口性能
        ctx.report({
          type: 'perf',
          subType: 'api_perf',
          data: { url, status: response.status, duration },
          timestamp: Date.now(),
        });

        return response;
      } catch (error) {
        ctx.report({
          type: 'error',
          subType: 'api_error',
          data: {
            url,
            status: 0,
            message: error instanceof Error ? error.message : 'Network Error',
            duration: Date.now() - startTime,
          },
          timestamp: Date.now(),
        });
        throw error;
      }
    };
  },
};

4.3 PerformancePlugin 完整实现

sdk/plugins/perf-plugin.ts
const PerformancePlugin: MonitorPlugin = {
  name: 'performance',

  setup(ctx: MonitorContext): void {
    // 1. 页面加载完成后采集 Navigation Timing
    const collectOnLoad = (): void => {
      // 确保 loadEventEnd 有值
      setTimeout(() => {
        const timing = collectNavigationTiming();
        if (timing) {
          ctx.report({
            type: 'perf',
            subType: 'navigation',
            data: timing,
            timestamp: Date.now(),
          });
        }
      }, 0);
    };

    if (document.readyState === 'complete') {
      collectOnLoad();
    } else {
      window.addEventListener('load', collectOnLoad);
    }

    // 2. Web Vitals
    initWebVitals(ctx.report.bind(ctx));

    // 3. 长任务检测
    observeLongTasks(ctx.report.bind(ctx));

    // 4. 资源加载瀑布（只上报慢资源）
    collectResourceTiming();
  },
};

4.4 使用示例

app/main.ts
// 一行代码完成接入
const monitor = new MonitorSDK({
  appId: 'my-app',
  reportUrl: 'https://monitor.example.com/api/report',
  sampleRate: 1.0,
  maxBufferSize: 20,
  flushInterval: 3000,
  plugins: [
    ErrorPlugin,
    PerformancePlugin,
    BehaviorPlugin,
  ],
  beforeSend(data) {
    // 数据脱敏：移除可能包含敏感信息的字段
    if (data.data.url) {
      const url = new URL(data.data.url as string, window.location.origin);
      url.searchParams.delete('token');
      url.searchParams.delete('password');
      data.data.url = url.toString();
    }
    return data;
  },
});

// 用户登录后设置用户标识
monitor.setUser('user_123');

// 手动上报自定义事件
monitor.trackEvent('purchase', {
  productId: 'prod_456',
  amount: 99.9,
  currency: 'CNY',
});

五、性能优化

5.1 SDK 体积优化

策略	说明	预期效果
按需加载插件	通过 `import()` 动态加载非核心插件	Core < 3KB gzip
Tree Shaking	ESM 格式发布，未使用的插件自动移除	减少 30-50% 体积
替换 web-vitals	手写轻量版 Web Vitals 采集	省去 ~2KB 依赖
压缩混淆	Terser 压缩 + 移除 console/debug 代码	减少 40-60% 体积

按需加载插件示例
// 核心包只包含 ErrorPlugin
import { MonitorSDK, ErrorPlugin } from '@my/monitor';

const monitor = new MonitorSDK({
  appId: 'my-app',
  reportUrl: '/api/report',
  plugins: [ErrorPlugin],
});

// 性能插件按需异步加载
if (window.PerformanceObserver) {
  import('@my/monitor/plugins/performance').then(({ PerformancePlugin }) => {
    monitor.use(PerformancePlugin);
  });
}

5.2 上报性能优化

策略	实现方式
合并请求	缓冲队列 + 批量上报，减少 HTTP 请求数
数据压缩	使用 CompressionStream API 进行 gzip 压缩
requestIdleCallback	非关键数据在浏览器空闲时采集和上报
Web Worker	将数据序列化、压缩等 CPU 密集操作放到 Worker 线程

使用 requestIdleCallback 延迟非关键采集
function scheduleIdleTask(task: () => void): void {
  if ('requestIdleCallback' in window) {
    requestIdleCallback(task, { timeout: 3000 });
  } else {
    // 降级方案
    setTimeout(task, 100);
  }
}

// 非关键数据在空闲时采集
scheduleIdleTask(() => {
  collectResourceTiming();
});

scheduleIdleTask(() => {
  collectUserEnvironment(); // 设备信息、网络类型等
});

5.3 Source Map 还原

生产环境代码压缩混淆后，错误堆栈中的行列号无法直接对应源码。需要通过 Source Map 还原：

server/sourcemap-service.ts
import { SourceMapConsumer } from 'source-map';
import * as fs from 'fs';

interface OriginalPosition {
  source: string | null;
  line: number | null;
  column: number | null;
  name: string | null;
}

async function resolveErrorPosition(
  sourceMapPath: string,
  line: number,
  column: number
): Promise<OriginalPosition> {
  const rawSourceMap = JSON.parse(
    fs.readFileSync(sourceMapPath, 'utf-8')
  );

  const consumer = await new SourceMapConsumer(rawSourceMap);

  const position = consumer.originalPositionFor({ line, column });

  consumer.destroy();
  return position;
  // { source: 'src/utils/format.ts', line: 42, column: 8, name: 'formatDate' }
}

Source Map 安全

使用 sourcemap: 'hidden' 生成 .map 文件但不在 JS 中添加 sourceMappingURL 注释
Source Map 文件绝不能部署到生产 CDN，应上传到内部错误分析平台
在 CI/CD 流水线中构建完成后自动上传到 Sentry，然后删除本地 .map 文件

六、扩展设计

6.1 用户回放（rrweb）

rrweb 可以记录用户操作的完整回放，在错误排查时还原"案发现场"：

sdk/plugins/replay-plugin.ts
import * as rrweb from 'rrweb';
import type { eventWithTime } from 'rrweb/typings/types';

function initReplayPlugin(report: (data: ReportData) => void): void {
  const events: eventWithTime[] = [];

  // 录制用户操作
  const stopRecord = rrweb.record({
    emit(event) {
      events.push(event);
      // 只保留最近 30 秒的事件，控制内存
      const thirtySecondsAgo = Date.now() - 30_000;
      while (events.length > 0 && events[0].timestamp < thirtySecondsAgo) {
        events.shift();
      }
    },
    // 只录制 DOM 变更和用户交互，不录制鼠标移动
    recordCanvas: false,
    sampling: {
      mousemove: false,
      mouseInteraction: true,
      scroll: 150,
      input: 'last',
    },
  });

  // 发生错误时，将最近 30 秒的回放数据一并上报
  window.addEventListener('error', () => {
    report({
      type: 'replay',
      subType: 'error_replay',
      data: { events: [...events] },
      timestamp: Date.now(),
    });
  });
}

6.2 告警规则引擎

server/alert-engine.ts
interface AlertRule {
  name: string;
  /** 监控的指标类型 */
  metric: string;
  /** 聚合窗口（秒） */
  window: number;
  /** 触发条件 */
  condition: {
    operator: '>' | '<' | '>=' | '<=' | '==';
    threshold: number;
  };
  /** 通知渠道 */
  channels: Array<'dingtalk' | 'feishu' | 'email' | 'sms'>;
  /** 静默期（秒），避免重复告警 */
  silencePeriod: number;
}

// 告警规则示例
const alertRules: AlertRule[] = [
  {
    name: 'JS 错误率过高',
    metric: 'js_error_rate',
    window: 300,           // 5 分钟窗口
    condition: { operator: '>', threshold: 0.01 },  // 错误率 > 1%
    channels: ['dingtalk', 'email'],
    silencePeriod: 1800,   // 30 分钟内不重复告警
  },
  {
    name: 'LCP 超标',
    metric: 'lcp_p75',
    window: 600,           // 10 分钟窗口
    condition: { operator: '>', threshold: 4000 },   // P75 > 4s
    channels: ['feishu'],
    silencePeriod: 3600,
  },
  {
    name: '接口成功率下降',
    metric: 'api_success_rate',
    window: 60,            // 1 分钟窗口
    condition: { operator: '<', threshold: 0.95 },   // 成功率 < 95%
    channels: ['dingtalk', 'sms'],
    silencePeriod: 300,
  },
];

6.3 全链路追踪（TraceId）

通过在请求中注入 TraceId，将前端操作与后端日志关联起来，实现全链路排查：

sdk/plugins/trace-plugin.ts
function generateTraceId(): string {
  // 使用 crypto API 生成高质量随机 ID
  const array = new Uint8Array(16);
  crypto.getRandomValues(array);
  return Array.from(array, (b) => b.toString(16).padStart(2, '0')).join('');
}

// 拦截 fetch，自动注入 TraceId
function injectTraceId(): void {
  const originalFetch = window.fetch;

  window.fetch = async (
    input: RequestInfo | URL,
    init?: RequestInit
  ): Promise<Response> => {
    const traceId = generateTraceId();
    const spanId = generateTraceId().slice(0, 16);

    const headers = new Headers(init?.headers);
    headers.set('X-Trace-Id', traceId);
    headers.set('X-Span-Id', spanId);
    headers.set('X-Request-Start', String(Date.now()));

    return originalFetch(input, { ...init, headers });
  };
}

6.4 A/B 测试集成

sdk/plugins/ab-test-plugin.ts
interface ABTestConfig {
  experimentId: string;
  variant: string;  // 'control' | 'treatment_a' | 'treatment_b'
}

// 将 A/B 测试分组信息注入到所有上报数据中
function initABTestPlugin(
  experiments: ABTestConfig[],
  report: (data: ReportData) => void
): void {
  // 将实验信息作为全局上下文，每条上报数据都会携带
  const experimentContext = experiments.reduce(
    (acc, exp) => {
      acc[`exp_${exp.experimentId}`] = exp.variant;
      return acc;
    },
    {} as Record<string, string>
  );

  // 后续所有上报数据都会自动携带实验分组信息
  // 在数据分析平台可按 variant 分组对比各指标差异
  report({
    type: 'custom',
    subType: 'ab_test_exposure',
    data: experimentContext,
    timestamp: Date.now(),
  });
}

七、服务端架构与存储方案

监控系统的服务端需要处理海量写入、实时聚合、灵活查询三大挑战。下面是主流的技术选型方案。

7.1 整体架构

7.2 接入层

方案	说明	适用规模
Nginx + access_log	用 1x1 GIF 或空响应接收数据，写入 access_log 由后续管道消费	中小规模，简单高效
Node.js / Go 服务	自定义接口，做数据校验、采样、充等	中大规模，需要灵活处理
API Gateway（Kong / APISIX）	统一网关，内置限流、认证、日志转发	大规模，微服务架构
Serverless（Cloudflare Workers / Lambda）	无服务器接入，自动弹性扩缩	弹性需求，流量波动大

server/receiver.ts（Node.js 示例）
import Fastify from 'fastify';
import { Kafka } from 'kafkajs';

const app = Fastify();
const kafka = new Kafka({ brokers: ['kafka:9092'] });
const producer = kafka.producer();

await producer.connect();

app.post('/api/report', async (request, reply) => {
  const events = Array.isArray(request.body) ? request.body : [request.body];

  // 写入 Kafka，异步处理，接口快速返回
  await producer.send({
    topic: 'monitor-events',
    messages: events.map((event) => ({
      key: event.appId,
      value: JSON.stringify({
        ...event,
        serverTime: Date.now(),
        ip: request.ip,
      }),
    })),
  });

  return { code: 0 };
});

app.listen({ port: 3000 });

7.3 消息队列

方案	特点	适用场景
Kafka	高吞吐、持久化、分区消费、生态最完善	首选，大规模监控系统标配
RabbitMQ	灵活路由、延迟队列、轻量	中小规模，需要复杂路由
Redis Stream	轻量、低延迟	小规模，或作为缓冲层
Pulsar	存算分离、多租户	超大规模，多业务线共享

为什么需要消息队列？

削峰填谷：前端流量有明显高峰（早 10 点、午休），MQ 平滑消费压力
解耦：接入层只负责接收，清洗/聚合/存储各自独立消费
重试：消费失败可重新消费，不丢数据
多消费者：同一份数据可以被告警、聚合、搜索等多个消费者并行处理

7.4 数据存储

不同类型的数据需要不同的存储引擎：

数据类型	推荐存储	说明
聚合指标（LCP P75、秒开率、错误率）	InfluxDB / VictoriaMetrics / Prometheus	时序数据库，擅长时间范围聚合查询
明细日志（每条错误/事件的详细信息）	ClickHouse / Elasticsearch	OLAP 引擎，支持多维分析和全文检索
错误详情（错误堆栈、用户路径）	Elasticsearch	全文搜索 + 结构化查询
SourceMap 文件	S3 / OSS	对象存储，按版本号组织
用户回放数据	S3 / OSS + MongoDB	大 JSON 文档，按 sessionId 索引
告警规则/配置	MySQL / PostgreSQL	关系型数据库，事务保证

ClickHouse（推荐用于大规模场景）

ClickHouse 是列式 OLAP 数据库，专为分析场景设计，在监控系统中越来越普及：

-- 建表示例：监控事件宽表
CREATE TABLE monitor_events (
    app_id       String,
    event_type   String,
    sub_type     String,
    timestamp    DateTime64(3),
    user_id      String,
    session_id   String,
    page_url     String,
    -- 性能指标
    metric_value Float64,
    -- 维度字段
    browser      String,
    os           String,
    network_type String,
    country      String,
    -- 错误信息
    error_message String,
    error_stack   String,
    -- JSON 扩展字段
    extra        String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)  -- 按天分区
ORDER BY (app_id, event_type, timestamp)
TTL timestamp + INTERVAL 90 DAY    -- 90 天自动过期
;

-- 查询示例：某应用过去 1 小时的 LCP P75
SELECT
    toStartOfMinute(timestamp) AS minute,
    quantile(0.75)(metric_value) AS lcp_p75
FROM monitor_events
WHERE app_id = 'my-app'
  AND sub_type = 'lcp'
  AND timestamp >= now() - INTERVAL 1 HOUR
GROUP BY minute
ORDER BY minute;

-- 查询示例：秒开率趋势
SELECT
    toStartOfHour(timestamp) AS hour,
    countIf(metric_value <= 1000) / count() AS second_open_rate
FROM monitor_events
WHERE app_id = 'my-app'
  AND sub_type = 'second_open'
  AND timestamp >= now() - INTERVAL 7 DAY
GROUP BY hour
ORDER BY hour;

ClickHouse vs Elasticsearch

对比项	ClickHouse	Elasticsearch
查询类型	聚合分析（SUM/AVG/P99）	全文搜索 + 聚合
写入性能	极高（百万行/秒）	中等（需要倒排索引构建）
存储成本	极低（列式压缩 10:1）	较高（倒排索引+原始数据）
适合场景	指标看板、趋势分析、Top N	错误搜索、日志检索

最佳实践：聚合指标存 ClickHouse，错误明细存 Elasticsearch，各取所长。

时序数据库（InfluxDB / VictoriaMetrics）

适合存储预聚合后的指标数据，用于 Grafana 看板展示：

# InfluxDB Line Protocol 示例
web_vitals,app=my-app,page=/home,browser=chrome lcp=2340,inp=180,cls=0.05 1709712000000000000
web_vitals,app=my-app,page=/detail,browser=safari lcp=3120,inp=250,cls=0.12 1709712000000000000

# 查询：过去 1 小时 LCP P75
SELECT percentile("lcp", 75) FROM "web_vitals"
WHERE time > now() - 1h
GROUP BY time(1m), "page"

7.5 数据清洗与聚合

数据清洗的关键步骤：

步骤	说明
格式校验	检查必填字段、数据类型、时间戳合理性
数据脱敏	移除 URL 中的 token/password 参数
UA 解析	从 UserAgent 提取浏览器、OS、设备信息
IP 定位	通过 IP 解析地理位置（国家、城市）
错误指纹	计算错误 fingerprint，用于聚合去重
维度补全	补充 app 版本、环境（prod/staging）等维度

7.6 可视化与告警

方案	说明
Grafana	对接 ClickHouse/InfluxDB/ES，配置看板和告警规则
自建看板	用 ECharts/AntV 构建定制化数据分析平台
Sentry	开箱即用的错误监控 SaaS，支持私有部署
PagerDuty / OpsGenie	告警分发、值班轮转、升级策略

7.7 主流开源/商业方案对比

方案	类型	特点	适用场景
Sentry	错误监控	错误聚合、SourceMap、回放、Issue 追踪	错误监控首选
GrowingIO / 神策	行为分析	埋点管理、漏斗分析、用户画像	产品分析
Prometheus + Grafana	指标监控	拉取模型、PromQL、告警规则	后端+基础设施
ELK Stack	日志分析	采集→搜索→可视化完整链路	日志检索
OpenTelemetry	可观测性	厂商中立、Traces/Metrics/Logs 统一	全链路追踪
ClickHouse	OLAP	列式存储、超快聚合	自建分析平台

技术选型建议

小团队 / 快速起步：Sentry（错误）+ Google Analytics（行为）+ 云厂商 APM
中等规模：Sentry + 自建 SDK + ClickHouse + Grafana
大规模 / 自建全套：自建 SDK + Kafka + Flink + ClickHouse + ES + Grafana

常见面试问题

Q1: 如何设计一个前端监控 SDK？核心架构是什么？

答案：

前端监控 SDK 的核心架构是 "微内核 + 插件" 模式：

微内核（Core）：只负责生命周期管理、配置管理、缓冲队列、上报通道
插件系统：ErrorPlugin、PerformancePlugin、BehaviorPlugin 等各模块作为独立插件按需注册

// 插件接口
interface MonitorPlugin {
  name: string;
  setup(ctx: MonitorContext): void;
  destroy?(): void;
}

// 使用示例
const monitor = new MonitorSDK({
  appId: 'my-app',
  reportUrl: '/api/report',
  plugins: [ErrorPlugin, PerformancePlugin],
});

设计要点：

插件化：解耦各功能模块，支持 Tree Shaking 和按需加载
缓冲队列 + 批量上报：减少请求次数，定量（10 条）或定时（5s）触发
多级降级上报：sendBeacon -> fetch(keepalive) -> img -> localStorage 离线缓存
采样率控制：全局采样 + 事件级采样 + 关键事件强制 100% 上报
beforeSend 钩子：支持数据脱敏和过滤
零侵入：通过劫持原生 API（onerror、fetch）自动采集，业务代码无需修改

Q2: 前端如何做全链路错误监控？需要覆盖哪些错误类型？

答案：

全链路错误监控需要组合使用多种捕获方式，覆盖 5 种错误类型：

错误类型	捕获方式	关键点
JS 运行时错误	`window.onerror`	注意跨域 Script Error
Promise 异常	`unhandledrejection`	捕获未 catch 的异步错误
资源加载错误	`addEventListener('error', ..., true)`	必须在捕获阶段监听
框架组件错误	React ErrorBoundary / Vue errorHandler	捕获渲染阶段异常
接口异常	拦截 fetch / XMLHttpRequest	监控非 2xx 状态码和网络错误

需要注意的问题：

跨域 Script Error：需要给 <script> 添加 crossorigin="anonymous"，CDN 响应 Access-Control-Allow-Origin
错误聚合：通过 message + stack 计算错误指纹，避免同一错误重复告警
Source Map：生产环境需要上传 Source Map 到 Sentry 等平台，还原压缩代码的错误位置

Q3: sendBeacon 和 fetch 有什么区别？为什么监控上报推荐 sendBeacon？

答案：

特性	sendBeacon	fetch
页面卸载后是否继续	浏览器保证发送	可能被取消（需 keepalive）
是否阻塞页面关闭	不阻塞	不阻塞
能否获取响应	不能	能
数据量限制	通常 64KB	keepalive 模式 64KB
返回值	`boolean`（是否入队成功）	`Promise<Response>`

推荐 sendBeacon 的三个核心原因：

页面卸载可靠性：用户关闭页面时，普通请求可能被取消。sendBeacon 将请求放入浏览器内部队列，即使页面卸载也能完成发送
不影响用户体验：请求优先级低，不会与业务请求争抢带宽
API 简洁：navigator.sendBeacon(url, data) 一行代码即可

// 最佳实践：visibilitychange + sendBeacon
document.addEventListener('visibilitychange', () => {
  if (document.visibilityState === 'hidden') {
    const blob = new Blob([JSON.stringify(bufferedData)], {
      type: 'application/json',
    });
    navigator.sendBeacon('/api/report', blob);
  }
});

Q4: 如何设计数据上报的采样策略？

答案：

采样策略需要在数据完整性和服务端成本之间找到平衡。核心原则是：关键数据不采样，高频数据按比例采样。

三层采样策略：

// 1. 全局采样率：控制整体数据量
const globalRate = 0.1;  // 10%

// 2. 事件级采样率：重要事件更高采样
const eventRates: Record<string, number> = {
  js_error: 1.0,     // 错误 100% 上报
  api_error: 1.0,    // 接口异常 100%
  lcp: 0.5,          // 性能 50%
  page_view: 0.3,    // PV 30%
  click: 0.01,       // 点击 1%
};

// 3. 强制上报名单：某些关键事件无论采样率如何都必须上报
const forceReportTypes = ['js_error', 'promise_error', 'fatal_error'];

此外，还可以引入用户级采样：通过 userId hash 决定某个用户是否被采样。这样同一用户的所有数据要么全部上报，要么全部不上报，有利于用户级别的问题排查。

Q5: 如何保证监控 SDK 不影响业务性能？

答案：

从以下 5 个维度控制 SDK 的性能影响：

维度	策略
体积	SDK Core < 3KB gzip，插件按需加载
初始化	异步加载 SDK，不阻塞首屏渲染
采集时机	非关键数据使用 `requestIdleCallback` 延迟采集
计算开销	数据序列化、压缩放到 Web Worker 中执行
网络开销	批量上报、采样控制、sendBeacon 低优先级发送

// 异步加载 SDK，不阻塞页面
// <script async src="monitor-sdk.js"></script>

// 非关键数据延迟采集
if ('requestIdleCallback' in window) {
  requestIdleCallback(() => {
    collectDeviceInfo();
    collectResourceTiming();
  }, { timeout: 3000 });
}

// CPU 密集操作放到 Worker
const worker = new Worker('monitor-worker.js');
worker.postMessage({ type: 'compress', data: reportData });

Q6: 用户操作回放（rrweb）的原理是什么？如何控制数据量？

答案：

rrweb 的核心原理是：

首次全量快照：将整个 DOM 树序列化为 JSON，记录初始状态
增量变更记录：通过 MutationObserver 监听 DOM 变更，只记录差异（增删改节点、属性变化、文本变化）
用户交互记录：记录鼠标位置、点击、滚动、输入等操作
回放：按时间戳顺序还原所有快照和变更

控制数据量的策略：

rrweb.record({
  emit(event) { /* ... */ },
  // 1. 选择性录制
  recordCanvas: false,        // 不录制 Canvas
  sampling: {
    mousemove: false,          // 不录制鼠标移动
    mouseInteraction: true,    // 只录制点击等交互
    scroll: 150,               // 滚动 150ms 采样一次
    input: 'last',             // 输入只记录最终值
  },
  // 2. 屏蔽敏感内容
  maskAllInputs: true,         // 所有输入框内容脱敏
  blockClass: 'no-record',     // 指定 class 的元素不录制
});

// 3. 只保留最近 N 秒的事件（滑动窗口）
// 发生错误时才上报最近 30 秒的回放数据

Q7: 如何实现前端和后端的全链路追踪？

答案：

全链路追踪的核心是 TraceId 透传。在前端发起请求时生成唯一的 TraceId，通过 HTTP Header 传递给后端，后端在日志中记录同一个 TraceId。

实现步骤：

前端生成 TraceId：使用 crypto.getRandomValues() 生成 128 位随机 ID
请求注入：拦截 fetch/XMLHttpRequest，在 Header 中注入 X-Trace-Id
后端透传：后端服务间调用时继续传递同一 TraceId
日志关联：前端错误上报携带 TraceId，在后端日志系统中搜索同一 ID 即可串联全链路

// 前端自动注入
const originalFetch = window.fetch;
window.fetch = async (input, init) => {
  const traceId = generateTraceId();
  const headers = new Headers(init?.headers);
  headers.set('X-Trace-Id', traceId);

  // 同时将 traceId 记录到监控数据中
  currentTraceId = traceId;

  return originalFetch(input, { ...init, headers });
};

这种方式兼容 OpenTelemetry 等标准化追踪方案，可以与后端的 Jaeger、Zipkin 等分布式追踪系统无缝对接。

Q8: 前端监控中如何做错误聚合去重？

答案：

错误聚合的核心是错误指纹（Fingerprint） 计算。将相同根因的错误归类到一起，避免同一个 Bug 触发成千上万条告警。

function generateErrorFingerprint(error: {
  message: string;
  stack?: string;
  source?: string;
}): string {
  // 1. 从 stack 中提取关键帧（第一个有效的调用栈位置）
  const stackLines = (error.stack ?? '').split('\n');
  const keyFrame = stackLines.find(
    (line) => line.includes('.js:') || line.includes('.ts:')
  ) ?? '';

  // 2. 移除动态内容（行号变化、URL 参数等）
  const normalizedFrame = keyFrame
    .replace(/:\d+:\d+/g, '')    // 移除行列号
    .replace(/\?.*$/g, '')        // 移除 URL 参数
    .trim();

  // 3. 组合 message + 关键帧生成指纹
  const raw = `${error.message}|${normalizedFrame}`;

  // 4. 使用 hash 算法生成固定长度指纹
  return hashString(raw);
}

function hashString(str: string): string {
  let hash = 0;
  for (let i = 0; i < str.length; i++) {
    const char = str.charCodeAt(i);
    hash = ((hash << 5) - hash) + char;
    hash = hash & hash; // 转为 32 位整数
  }
  return Math.abs(hash).toString(36);
}

服务端聚合逻辑：

相同指纹的错误归为一个 Issue
每个 Issue 记录首次出现时间、最近出现时间、影响用户数、出现次数
新 Issue 触发告警，已知 Issue 只更新计数（除非频率突增）

一、需求分析​

1.1 功能需求​

1.2 非功能需求​

二、整体架构​

2.1 监控链路全景​

2.2 数据流转时序​

三、核心模块设计​

3.1 SDK 核心架构 — 插件化设计​

3.2 错误监控模块​

3.3 性能监控模块​

3.4 页面体验指标​

页面秒开率​

首屏时间​

白屏检测​

页面崩溃检测​

FMP（首次有意义绘制）​

TTI（可交互时间）​

3.5 视频/直播监控指标​

视频秒开 & 首帧时间​

卡顿检测​

黑屏检测​

音画同步检测​

码率切换与播放失败​

3.6 补充通用指标​

接口成功率与耗时分布​

SPA 路由切换耗时​

帧率（FPS）监控​

内存占用监控​

3.7 用户行为追踪模块​

3.8 数据上报策略​

四、关键技术实现​

4.1 Monitor SDK 完整实现​

4.2 ErrorPlugin 完整实现​

4.3 PerformancePlugin 完整实现​

4.4 使用示例​

五、性能优化​

5.1 SDK 体积优化​

5.2 上报性能优化​

5.3 Source Map 还原​

六、扩展设计​

6.1 用户回放（rrweb）​

6.2 告警规则引擎​

6.3 全链路追踪（TraceId）​

6.4 A/B 测试集成​

七、服务端架构与存储方案​

7.1 整体架构​

7.2 接入层​

7.3 消息队列​

7.4 数据存储​

ClickHouse（推荐用于大规模场景）​

时序数据库（InfluxDB / VictoriaMetrics）​

7.5 数据清洗与聚合​

7.6 可视化与告警​

7.7 主流开源/商业方案对比​

常见面试问题​

Q1: 如何设计一个前端监控 SDK？核心架构是什么？​

Q2: 前端如何做全链路错误监控？需要覆盖哪些错误类型？​

Q3: sendBeacon 和 fetch 有什么区别？为什么监控上报推荐 sendBeacon？​

Q4: 如何设计数据上报的采样策略？​

Q5: 如何保证监控 SDK 不影响业务性能？​

Q6: 用户操作回放（rrweb）的原理是什么？如何控制数据量？​

Q7: 如何实现前端和后端的全链路追踪？​

Q8: 前端监控中如何做错误聚合去重？​

相关链接​