设计 Web 爬虫系统

问题

如何用 Python 设计一个高效的 Web 爬虫系统？Scrapy 的核心架构是什么？

答案

Scrapy 架构

Scrapy 爬虫示例

spiders/product_spider.py
import scrapy
from items import ProductItem

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products?page=1"]

    # 自定义配置
    custom_settings = {
        "CONCURRENT_REQUESTS": 16,
        "DOWNLOAD_DELAY": 0.5,
        "RETRY_TIMES": 3,
    }

    def parse(self, response):
        # 解析商品列表
        for card in response.css("div.product-card"):
            item = ProductItem()
            item["name"] = card.css("h3::text").get()
            item["price"] = card.css("span.price::text").get()
            detail_url = card.css("a::attr(href)").get()
            # 跟进详情页
            yield response.follow(detail_url, self.parse_detail, meta={"item": item})

        # 翻页
        next_page = response.css("a.next-page::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def parse_detail(self, response):
        item = response.meta["item"]
        item["description"] = response.css("div.detail::text").get()
        yield item

异步爬虫（aiohttp）

async_crawler.py
import asyncio
import aiohttp
from urllib.parse import urljoin

class AsyncCrawler:
    def __init__(self, max_concurrent: int = 10):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.visited: set[str] = set()
        self.results: list[dict] = []

    async def fetch(self, session: aiohttp.ClientSession, url: str) -> str | None:
        async with self.semaphore:
            try:
                async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
                    if resp.status == 200:
                        return await resp.text()
            except (aiohttp.ClientError, asyncio.TimeoutError):
                return None

    async def crawl(self, start_url: str, max_pages: int = 100):
        async with aiohttp.ClientSession() as session:
            queue: asyncio.Queue[str] = asyncio.Queue()
            await queue.put(start_url)

            while not queue.empty() and len(self.visited) < max_pages:
                url = await queue.get()
                if url in self.visited:
                    continue
                self.visited.add(url)

                html = await self.fetch(session, url)
                if html:
                    self.parse(html, url, queue)

    def parse(self, html: str, base_url: str, queue: asyncio.Queue):
        # 使用 BeautifulSoup/lxml 解析
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html, "lxml")
        for link in soup.find_all("a", href=True):
            abs_url = urljoin(base_url, link["href"])
            if abs_url not in self.visited:
                queue.put_nowait(abs_url)

反爬应对

middlewares.py
import random

class RandomUserAgentMiddleware:
    USER_AGENTS = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...",
    ]

    def process_request(self, request, spider):
        request.headers["User-Agent"] = random.choice(self.USER_AGENTS)

class ProxyMiddleware:
    def process_request(self, request, spider):
        proxy = get_proxy_from_pool()  # 从代理池获取
        request.meta["proxy"] = f"http://{proxy}"

常见面试问题

Q1: 如何处理 JS 动态渲染的页面？

答案：

Splash：轻量级浏览器渲染服务，Scrapy-Splash 集成
Playwright/Selenium：无头浏览器执行 JS
逆向接口：分析 XHR/Fetch 请求直接调 API（首选）

Q2: 爬虫去重策略？

答案：

URL 去重：布隆过滤器（内存小，允许少量误判）
内容去重：SimHash / MinHash 判断页面相似度
Scrapy 内置：RFPDupeFilter 使用 URL 指纹

Q3: 分布式爬虫怎么实现？

答案：

Scrapy-Redis：用 Redis 作为共享调度队列，多个 Worker 消费
URL 分片：按域名 Hash 分配给不同 Worker
去重共享：Redis 布隆过滤器

Q4: 如何避免被封 IP？

答案：

降低请求频率（DOWNLOAD_DELAY）
随机 User-Agent
IP 代理池轮换
Cookie 池
遵守 robots.txt

问题​

答案​

Scrapy 架构​

Scrapy 爬虫示例​

异步爬虫（aiohttp）​

反爬应对​

常见面试问题​

Q1: 如何处理 JS 动态渲染的页面？​

Q2: 爬虫去重策略？​

Q3: 分布式爬虫怎么实现？​

Q4: 如何避免被封 IP？​

相关链接​

问题

答案

Scrapy 架构

Scrapy 爬虫示例

异步爬虫（aiohttp）

反爬应对

常见面试问题

Q1: 如何处理 JS 动态渲染的页面？

Q2: 爬虫去重策略？

Q3: 分布式爬虫怎么实现？

Q4: 如何避免被封 IP？

相关链接